Re: performance of IN clause

2018-10-17 Thread Silvio Fiorito
Have you run explain for each query? If you look at the physical query plan it’s most likely the same. If the inner-query/join-table is small enough it should end up as a broadcast join. From: Jayesh Lalwani Date: Wednesday, October 17, 2018 at 5:03 PM To: "user@spark.apache.org" Subject:

performance of IN clause

2018-10-17 Thread Jayesh Lalwani
Is there a significant differrence in how a IN clause performs when compared to a JOIN? Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million Will this query? *Select * from A where join_key in (Select join_key from B)* *perform much worse than* * Select * from A* *INNER

[PySpark SQL]: SparkConf does not exist in the JVM

2018-10-17 Thread takao
Hi, `pyspark.sql.SparkSession.builder.getOrCreate()` gives me an error, and I wonder if anyone can help me with this. The line of code that gives me an error is ``` with spark_session(master, app_name) as session: ``` where spark_session is Python's context manager: ```

Re: Spark In Memory Shuffle

2018-10-17 Thread ☼ R Nair
What are the steps to configure this? Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester wrote: > Hi, > I failed to config spark for in-memory shuffle so currently just > using linux memory mapped directory (tmpfs) as working directory of spark, > so everything is fast > > Sent using

Re: Spark In Memory Shuffle

2018-10-17 Thread Gourav Sengupta
super duper, I also need to try this out. On Wed, Oct 17, 2018 at 2:39 PM onmstester onmstester wrote: > Hi, > I failed to config spark for in-memory shuffle so currently just > using linux memory mapped directory (tmpfs) as working directory of spark, > so everything is fast > > Sent using

Re: Spark In Memory Shuffle

2018-10-17 Thread onmstester onmstester
Hi, I failed to config spark for in-memory shuffle so currently just using  linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail On Wed, 17 Oct 2018 16:41:32 +0330  thomas lavocat wrote Hi everyone, The possibility to have in

FW: Pyspark: set Orc Stripe.size on dataframe writer issue

2018-10-17 Thread Somasundara, Ashwin
Hello Group I am having issues setting the stripe size, index stride and index on an orc file using PySpark. I am getting approx 2000 stripes for the 1.2GB file when I am expecting only 5 stripes for the 256MB setting. Tried the below options 1. Set the .options on data frame writer. The

Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat
Hi everyone, The possibility to have in memory shuffling is discussed in this issue https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I would like to know : What is the current state of

Re: SparkSQL read Hive transactional table

2018-10-17 Thread Gourav Sengupta
Hi, I think that the speed of ORC has been improved in latest versions. Any chance you could use the latest version? Regards, Gourav Sengupta On 17 Oct 2018 6:11 am, "daily" wrote: Hi, Spark version: 2.3.0 Hive version: 2.1.0 Best regards. -- 原始邮件 --