performance when checking if data frame is empty or not

2015-09-08 Thread Axel Dahl
I have a join, that fails when one of the data frames is empty. To avoid this I am hoping to check if the dataframe is empty or not before the join. The question is what's the most performant way to do that? should I do df.count() or df.first() or something else? Thanks in advance, -Axel

Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Axel Dahl
logged it here: https://issues.apache.org/jira/browse/SPARK-10436 On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu <dav...@databricks.com> wrote: > I think it's a missing feature. > > On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote: > > So a bi

spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
in my spark-defaults.conf I have: spark.files file1.zip, file2.py spark.master spark://master.domain.com:7077 If I execute: bin/pyspark I can see it adding the files correctly. However if I execute bin/spark-submit test.py where test.py relies on the file1.zip, I get

Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
ote: > This should be a bug, could you create a JIRA for it? > > On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote: > > in my spark-defaults.conf I have: > > spark.files file1.zip, file2.py > > spark.master spark:/

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
what you want on the other hand, if you application needs all cores of your cluster and only some specific job should run on single executor there are few methods to achieve this e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition On 18 August 2015 at 01:36, Axel Dahl

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
, but only executes everything on 1 node, it looks like it's not grabbing the extra nodes. On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl a...@whisperstream.com wrote: That worked great, thanks Andrew. On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or and...@databricks.com wrote: Hi Axel, You can try

how do I execute a job on a single worker node in standalone mode

2015-08-17 Thread Axel Dahl
I have a 4 node cluster and have been playing around with the num-executors parameters, executor-memory and executor-cores I set the following: --executor-memory=10G --num-executors=1 --executor-cores=8 But when I run the job, I see that each worker, is running one executor which has 2 cores

is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote: I have the following

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
1.4? Nick 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com javascript:_e(%7B%7D,'cvml','a...@whisperstream.com');님이 작성: still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com javascript:_e(%7B%7D,'cvml

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
was specific to 1.4. On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote: I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's code would be failing right now. On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, you shouldn't

dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire',