[jira] [Created] (SPARK-6072) Enable hash joins for nullable columns

2015-02-27 Thread Dima Zhiyanov (JIRA)
Dima Zhiyanov created SPARK-6072:


 Summary: Enable hash joins for nullable columns
 Key: SPARK-6072
 URL: https://issues.apache.org/jira/browse/SPARK-6072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Dima Zhiyanov


Currently joins such as 
A join B on A.x = B.x AND A.y = B.y
are evaluated as hash join on just x followed by filter on y. 
This causes a skew problem (very long join) when a particular value of x has a 
high cardinality even though (x, y) is evenly distributed

Can we implement is as a hash join on (X, Option(Y))? This will eliminate the 
skew in this case

Imagine a join:
 People as p1 join People as p2 on p1.name = p2.name and p1.address = 
p2.address

(very small percentage of people has unknown address)

This causes a skewed join on popular names such as Mary Brown if we hash on 
names alone, but will not cause a skew if we hash on (Name, Option(Address))




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



DataFrame: Enable zipWithUniqueId

2015-02-20 Thread Dima Zhiyanov
Hello

Question regarding the new DataFrame API introduced here
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

I oftentimes use the zipWithUniqueId method of the SchemaRDD (as an RDD) to
replace string keys with more efficient long keys. Would it be possible to
use the same method in the new DataFrame class?

It looks like unlike the SchemaRdd DataFrame does not extend RDD

Thanks
Dima




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-Enable-zipWithUniqueId-tp21733.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Dima Zhiyanov
Hello 

Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql? 

Thanks 
Dima



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21632.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Test

2015-02-12 Thread Dima Zhiyanov


Sent from my iPhone

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello 

Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql? 

Thanks 
Dima 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello

Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?

Thanks
Dima




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello 

Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql? 

Thanks 
Dima 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Thank you!

The Hive solution seemed more like a workaround. I was wondering if a native 
Spark Sql support for computing statistics for Parquet files would be available 

Dima



Sent from my iPhone

 On Feb 11, 2015, at 3:34 PM, Ted Yu yuzhih...@gmail.com wrote:
 
 See earlier thread:
 http://search-hadoop.com/m/JW1q5BZhf92
 
 On Wed, Feb 11, 2015 at 3:04 PM, Dima Zhiyanov dimazhiya...@gmail.com 
 wrote:
 Hello
 
 Has Spark implemented computing statistics for Parquet files? Or is there
 any other way I can enable broadcast joins between parquet file RDDs in
 Spark Sql?
 
 Thanks
 Dima
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21609.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Dima Zhiyanov
I am also experiencing this kryo buffer problem. My join is left outer with
under 40mb on the right side. I would expect the broadcast join to succeed
in this case (hive did)
Another problem is that the optimizer 
chose nested loop join for some reason
I would expect broadcast (map side) hash join. 
Am I correct in my expectations?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Dima Zhiyanov
Yes

Sent from my iPhone

 On Aug 5, 2014, at 7:38 AM, Dima Zhiyanov [via Apache Spark User List] 
 ml-node+s1001560n11432...@n3.nabble.com wrote:
 
 I am also experiencing this kryo buffer problem. My join is left outer with 
 under 40mb on the right side. I would expect the broadcast join to succeed 
 in this case (hive did) 
 Another problem is that the optimizer 
 chose nested loop join for some reason 
 I would expect broadcast (map side) hash join. 
 Am I correct in my expectations? 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11432.html
 To unsubscribe from spark sql left join gives KryoException: Buffer overflow, 
 click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.