[jira] [Created] (SPARK-6072) Enable hash joins for nullable columns
Dima Zhiyanov created SPARK-6072: Summary: Enable hash joins for nullable columns Key: SPARK-6072 URL: https://issues.apache.org/jira/browse/SPARK-6072 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Dima Zhiyanov Currently joins such as A join B on A.x = B.x AND A.y = B.y are evaluated as hash join on just x followed by filter on y. This causes a skew problem (very long join) when a particular value of x has a high cardinality even though (x, y) is evenly distributed Can we implement is as a hash join on (X, Option(Y))? This will eliminate the skew in this case Imagine a join: People as p1 join People as p2 on p1.name = p2.name and p1.address = p2.address (very small percentage of people has unknown address) This causes a skewed join on popular names such as Mary Brown if we hash on names alone, but will not cause a skew if we hash on (Name, Option(Address)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
DataFrame: Enable zipWithUniqueId
Hello Question regarding the new DataFrame API introduced here https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html I oftentimes use the zipWithUniqueId method of the SchemaRDD (as an RDD) to replace string keys with more efficient long keys. Would it be possible to use the same method in the new DataFrame class? It looks like unlike the SchemaRdd DataFrame does not extend RDD Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-Enable-zipWithUniqueId-tp21733.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to do broadcast join in SparkSQL
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21632.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Test
Sent from my iPhone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to do broadcast join in SparkSQL
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to do broadcast join in SparkSQL
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to do broadcast join in SparkSQL
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21611.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to do broadcast join in SparkSQL
Thank you! The Hive solution seemed more like a workaround. I was wondering if a native Spark Sql support for computing statistics for Parquet files would be available Dima Sent from my iPhone On Feb 11, 2015, at 3:34 PM, Ted Yu yuzhih...@gmail.com wrote: See earlier thread: http://search-hadoop.com/m/JW1q5BZhf92 On Wed, Feb 11, 2015 at 3:04 PM, Dima Zhiyanov dimazhiya...@gmail.com wrote: Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-tp15298p21609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql left join gives KryoException: Buffer overflow
I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the right side. I would expect the broadcast join to succeed in this case (hive did) Another problem is that the optimizer chose nested loop join for some reason I would expect broadcast (map side) hash join. Am I correct in my expectations? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11432.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql left join gives KryoException: Buffer overflow
Yes Sent from my iPhone On Aug 5, 2014, at 7:38 AM, Dima Zhiyanov [via Apache Spark User List] ml-node+s1001560n11432...@n3.nabble.com wrote: I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the right side. I would expect the broadcast join to succeed in this case (hive did) Another problem is that the optimizer chose nested loop join for some reason I would expect broadcast (map side) hash join. Am I correct in my expectations? If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11432.html To unsubscribe from spark sql left join gives KryoException: Buffer overflow, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-left-join-gives-KryoException-Buffer-overflow-tp10157p11433.html Sent from the Apache Spark User List mailing list archive at Nabble.com.