I have a join, that fails when one of the data frames is empty.
To avoid this I am hoping to check if the dataframe is empty or not before
the join.
The question is what's the most performant way to do that?
should I do df.count() or df.first() or something else?
Thanks in advance,
-Axel
logged it here:
https://issues.apache.org/jira/browse/SPARK-10436
On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu <dav...@databricks.com> wrote:
> I think it's a missing feature.
>
> On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote:
> > So a bi
in my spark-defaults.conf I have:
spark.files file1.zip, file2.py
spark.master spark://master.domain.com:7077
If I execute:
bin/pyspark
I can see it adding the files correctly.
However if I execute
bin/spark-submit test.py
where test.py relies on the file1.zip, I get
ote:
> This should be a bug, could you create a JIRA for it?
>
> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote:
> > in my spark-defaults.conf I have:
> > spark.files file1.zip, file2.py
> > spark.master spark:/
what you want
on the other hand, if you application needs all cores of your cluster and
only some specific job should run on single executor there are few methods
to achieve this
e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition
On 18 August 2015 at 01:36, Axel Dahl
, but only executes
everything on 1 node, it looks like it's not grabbing the extra nodes.
On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl a...@whisperstream.com wrote:
That worked great, thanks Andrew.
On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or and...@databricks.com wrote:
Hi Axel,
You can try
I have a 4 node cluster and have been playing around with the num-executors
parameters, executor-memory and executor-cores
I set the following:
--executor-memory=10G
--num-executors=1
--executor-cores=8
But when I run the job, I see that each worker, is running one executor
which has 2 cores
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
still feels like a bug to have to create unique names before a join.
On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:
You can declare the schema with unique names before creation of df.
On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:
I have the following
1.4?
Nick
2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com
javascript:_e(%7B%7D,'cvml','a...@whisperstream.com');님이 작성:
still feels like a bug to have to create unique names before a join.
On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com
javascript:_e(%7B%7D,'cvml
was
specific to 1.4.
On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com
wrote:
I've only tested on 1.4, but imagine 1.3 is the same or a lot of
people's code would be failing right now.
On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Yeah, you shouldn't
I have the following code:
from pyspark import SQLContext
d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
'country': 'ire',
12 matches
Mail list logo