Re: Support for Hive buckets

2014-12-24 Thread tanejagagan
(for example, you might be able to avoid a shuffle when doing joins on tables that are already bucketed by exposing more metastore information to the planner). Can you provide more input on how to implement this functionality so that i can speed up join between 2 hive tables, both with few bill

Re: cleaning up cache files left by SPARK-2713

2014-12-24 Thread Josh Rosen
I reviewed and merged that PR, in case you want to try out the fix. - Josh On December 22, 2014 at 10:40:35 AM, Marcelo Vanzin (van...@cloudera.com) wrote: https://github.com/apache/spark/pull/3705 On Mon, Dec 22, 2014 at 10:19 AM, Cody Koeninger wrote: > Is there a reason not to go ahead an

Re: Confirming race condition in DagScheduler (NoSuchElementException)

2014-12-24 Thread Josh Rosen
I’m investigating this issue and left some comments on the proposed fix:  https://github.com/apache/spark/pull/3345#issuecomment-68014353 To summarize, I agree with your description of the problem but think that the right fix may be a bit more involved than what’s proposed in that PR (that PR’s

Starting with Spark

2014-12-24 Thread Naveen Madhire
Hi All, I am starting to use Spark. I am having trouble getting the latest code from git. I am using Intellij as suggested in the below link, https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-StarterTasks The below link isn't working as well, http://sp

Re: Starting with Spark

2014-12-24 Thread Timothy Chen
What error are you getting? Tim Sent from my iPhone > On Dec 24, 2014, at 8:59 PM, Naveen Madhire wrote: > > Hi All, > > I am starting to use Spark. I am having trouble getting the latest code > from git. > I am using Intellij as suggested in the below link, > > https://cwiki.apache.org/conf

Problems with large dataset using collect() and broadcast()

2014-12-24 Thread Will Yang
Hi all, In my occasion, I have a huge HashMap[(Int, Long), (Double, Double, Double)], say several GB to tens of GB, after each iteration, I need to collect() this HashMap and perform some calculation, and then broadcast() it to every node. Now I have 20GB for each executor and after it performances

Re: Problems with large dataset using collect() and broadcast()

2014-12-24 Thread Patrick Wendell
Hi Will, When you call collect() the item you are collecting needs to fit in memory on the driver. Is it possible your driver program does not have enough memory? - Patrick On Wed, Dec 24, 2014 at 9:34 PM, Will Yang wrote: > Hi all, > In my occasion, I have a huge HashMap[(Int, Long), (Double,

Re: Starting with Spark

2014-12-24 Thread Nicholas Chammas
The correct docs link is: https://spark.apache.org/docs/1.2.0/building-spark.html Where did you get that bad link from? Nick On Thu Dec 25 2014 at 12:00:53 AM Naveen Madhire wrote: > Hi All, > > I am starting to use Spark. I am having trouble getting the latest code > from git. > I am using I

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apach

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai wrote: > Hi, > > > > We have such requirements to save RDD output to HDFS with saveAsTextFile > like API, but need

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: u...@spark.apache.org;

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that

Re: Which committers care about Kafka?

2014-12-24 Thread Cody Koeninger
After a long talk with Patrick and TD (thanks guys), I opened the following jira https://issues.apache.org/jira/browse/SPARK-4964 Sample PR has an impementation for the batch and the dstream case, and a link to a project with example usage. On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers wrote:

Re: Which committers care about Kafka?

2014-12-24 Thread Hari Shreedharan
In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger wrote: > After a long talk with Patrick and TD (thanks guys), I opened the following > jira > https://issues.apache.org/jir

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with