Have you run explain for each query? If you look at the physical query plan
it’s most likely the same. If the inner-query/join-table is small enough it
should end up as a broadcast join.
From: Jayesh Lalwani
Date: Wednesday, October 17, 2018 at 5:03 PM
To: "user@spark.apache.org"
Subject:
Is there a significant differrence in how a IN clause performs when
compared to a JOIN?
Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million
Will this query?
*Select * from A where join_key in (Select join_key from B)*
*perform much worse than*
* Select * from A*
*INNER
Hi,
`pyspark.sql.SparkSession.builder.getOrCreate()` gives me an error, and I
wonder if anyone can help me with this.
The line of code that gives me an error is
```
with spark_session(master, app_name) as session:
```
where spark_session is Python's context manager:
```
What are the steps to configure this? Thanks
On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester
wrote:
> Hi,
> I failed to config spark for in-memory shuffle so currently just
> using linux memory mapped directory (tmpfs) as working directory of spark,
> so everything is fast
>
> Sent using
super duper, I also need to try this out.
On Wed, Oct 17, 2018 at 2:39 PM onmstester onmstester
wrote:
> Hi,
> I failed to config spark for in-memory shuffle so currently just
> using linux memory mapped directory (tmpfs) as working directory of spark,
> so everything is fast
>
> Sent using
Hi, I failed to config spark for in-memory shuffle so currently just using
linux memory mapped directory (tmpfs) as working directory of spark, so
everything is fast Sent using Zoho Mail On Wed, 17 Oct 2018 16:41:32 +0330
thomas lavocat wrote Hi everyone,
The possibility to have in
Hello Group
I am having issues setting the stripe size, index stride and index on an orc
file using PySpark. I am getting approx 2000 stripes for the 1.2GB file when I
am expecting only 5 stripes for the 256MB setting.
Tried the below options
1. Set the .options on data frame writer. The
Hi everyone,
The possibility to have in memory shuffling is discussed in this issue
https://github.com/apache/spark/pull/5403. It was in 2015.
In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still
shuffle using disks. I would like to know :
What is the current state of
Hi,
I think that the speed of ORC has been improved in latest versions. Any
chance you could use the latest version?
Regards,
Gourav Sengupta
On 17 Oct 2018 6:11 am, "daily" wrote:
Hi,
Spark version: 2.3.0
Hive version: 2.1.0
Best regards.
-- 原始邮件 --