Re: Spark stages very slow to complete

2015-06-02 Thread Karlson
Hi, the code is some hundreds lines of Python. I can try to compose a minimal example as soon as I find the time, though. Any ideas until then? Would you mind posting the code? On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote: Hi, In all (pyspark) Spark jobs, that become somewhat

Spark stages very slow to complete

2015-06-01 Thread Karlson
Hi, In all (pyspark) Spark jobs, that become somewhat more involved, I am experiencing the issue that some stages take a very long time to complete and sometimes don't at all. This clearly correlates with the size of my input data. Looking at the stage details for one such stage, I am

Re: [pyspark] Starting workers in a virtualenv

2015-05-22 Thread Karlson
That works, thank you! On 2015-05-22 03:15, Davies Liu wrote: Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Alright, that doesn't seem to have made it into the Python API yet. On 2015-05-22 15:12, Silvio Fiorito wrote: This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
, ayan guha wrote: DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, Karlson ksonsp

Re: Join on DataFrames from the same source (Pyspark)

2015-04-22 Thread Karlson
add an alias as follows: from pyspark.sql.functions import * df.alias(a).join(df.alias(b), col(a.col1) == col(b.col1)) On Tue, Apr 21, 2015 at 8:10 AM, Karlson ksonsp...@siberie.de wrote: Sorry, my code actually was df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
= df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 2015 at 12:42 AM, Karlson ksonsp...@siberie.de wrote: Hi, can anyone confirm

[pyspark] Starting workers in a virtualenv

2015-04-20 Thread Karlson
Hi all, I am running the Python process that communicates with Spark in a virtualenv. Is there any way I can make sure that the Python processes of the workers are also started in a virtualenv? Currently I am getting ImportErrors when the worker tries to unpickle stuff that is not installed

Save and read parquet from the same path

2015-03-04 Thread Karlson
Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
,org.apache.spark.OneToOneDependency@7bc172ec) (d3 ShuffledRDD[12] at groupByKey at console:12,org.apache.spark.ShuffleDependency@d794984) (MappedRDD[11] at map at console:12,org.apache.spark.OneToOneDependency@15c98005) On Thu, Feb 12, 2015 at 10:05 AM, Karlson ksonsp...@siberie.de wrote: Hi Imran

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
)) ? As I understand, this would preserve the original partitioning. On 2015-02-13 12:43, Karlson wrote: Does that mean partitioning does not work in Python? Or does this only effect joining? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote: Hi Karlson, I think your

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
. What's the explanation for that behaviour? Where am I wrong with my assumption? Thanks in advance, Karlson - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h