Liquan, yes, for full outer join, one hash table on both sides is more
efficient.
For the left/right outer join, it looks like one hash table should be enought.
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc:
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's in
the following patch.
https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
Jianshi
On Mon, Sep 29,
I'm pretty sure inner joins on Spark SQL already build only one of the sides.
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer
joins do both, and it seems like we could optimize it for those that are not
full.
Matei
On Oct 7, 2014, at 11:04 PM, Haopu Wang
I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I'm pretty sure inner
Ok, currently there's cost-based optimization however Parquet statistics is
not implemented...
What's the good way if I want to join a big fact table with several tiny
dimension tables in Spark SQL (1.1)?
I wish we can allow user hint for the join.
Jianshi
On Wed, Oct 8, 2014 at 2:18 PM,
Hi all,
In my limited understanding of the MLlib, it is a good idea to use the
various distance functions on some machine learning algorithms. For example,
we can only use Euclidean distance metric in KMeans. And I am tackling with
contributing hierarchical clustering to MLlib
So:
- tags: can delete
- branches: stuck with ‘em
Correct?
Nick
On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell pwend...@gmail.com wrote:
Actually - weirdly - we can delete old tags and it works with the
mirroring. Nick if you put together a list of un-needed tags I can
delete
Yep! That's the example I was talking about.
Is an error message printed when it hangs? I get :
14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140930-131734-1723727882-5050-1895-1
On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi
I've created SPARK-3849: Automate remaining Scala style rules
https://issues.apache.org/jira/browse/SPARK-3849.
Please create sub-tasks on this issue for rules that we have not automated
and let's work through them as possible.
I went ahead and created the first sub-task, SPARK-3850: Scala
Didn't see anyone asked the question before, but I was wondering if anyone
knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
getting more and more popular hi Hive world.
Thanks,
James
James,
Michael at the meetup last night said there was some development
activity around ORCFiles.
I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote:
Didn't see anyone asked the question before, but I was
Hi Yu,
We upgraded breeze to 0.10 yesterday. So we can call the distance
functions you contributed to breeze easily. We don't want to maintain
another copy of the implementation in MLlib to keep the maintenance
cost low. Both spark and breeze are open-source projects. We should
try our best to
Thanks for the input. We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term. That said we'll try and make this easier in the future
either through hints or better support for statistics.
In this particular
13 matches
Mail list logo