RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc:

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged into master? I cannot find spark.sql.hints.broadcastTables in latest master, but it's in the following patch. https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5 Jianshi On Mon, Sep 29,

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
I am working on a PR to leverage the HashJoin trait code to optimize the Left/Right outer join. It's already been tested locally and will send out the PR soon after some clean up. Thanks, Liquan On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia matei.zaha...@gmail.com wrote: I'm pretty sure inner

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
Ok, currently there's cost-based optimization however Parquet statistics is not implemented... What's the good way if I want to join a big fact table with several tiny dimension tables in Spark SQL (1.1)? I wish we can allow user hint for the join. Jianshi On Wed, Oct 8, 2014 at 2:18 PM,

Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi all, In my limited understanding of the MLlib, it is a good idea to use the various distance functions on some machine learning algorithms. For example, we can only use Euclidean distance metric in KMeans. And I am tackling with contributing hierarchical clustering to MLlib

Re: Unneeded branches/tags

2014-10-08 Thread Nicholas Chammas
So: - tags: can delete - branches: stuck with ‘em Correct? Nick ​ On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell pwend...@gmail.com wrote: Actually - weirdly - we can delete old tags and it works with the mirroring. Nick if you put together a list of un-needed tags I can delete

Re: Spark on Mesos 0.20

2014-10-08 Thread RJ Nowling
Yep! That's the example I was talking about. Is an error message printed when it hangs? I get : 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140930-131734-1723727882-5050-1895-1 On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi

Re: Extending Scala style checks

2014-10-08 Thread Nicholas Chammas
I've created SPARK-3849: Automate remaining Scala style rules https://issues.apache.org/jira/browse/SPARK-3849. Please create sub-tasks on this issue for rules that we have not automated and let's work through them as possible. I went ahead and created the first sub-task, SPARK-3850: Scala

will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Evan Chan
James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote: Didn't see anyone asked the question before, but I was

Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Xiangrui Meng
Hi Yu, We upgraded breeze to 0.10 yesterday. So we can call the distance functions you contributed to breeze easily. We don't want to maintain another copy of the implementation in MLlib to keep the maintenance cost low. Both spark and breeze are open-source projects. We should try our best to

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input. We purposefully made sure that the config option did not make it into a release as it is not something that we are willing to support long term. That said we'll try and make this easier in the future either through hints or better support for statistics. In this particular