Hi, the code is some hundreds lines of Python. I can try to compose a
minimal example as soon as I find the time, though. Any ideas until
then?
Would you mind posting the code?
On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:
Hi,
In all (pyspark) Spark jobs, that become somewhat
Hi,
In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to
complete and sometimes don't at all. This clearly correlates with the
size of my input data. Looking at the stage details for one such stage,
I am
That works, thank you!
On 2015-05-22 03:15, Davies Liu wrote:
Could you try with specify PYSPARK_PYTHON to the path of python in
your virtual env, for example
PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py
On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote
Alright, that doesn't seem to have made it into the Python API yet.
On 2015-05-22 15:12, Silvio Fiorito wrote:
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD
, ayan guha wrote:
DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already
optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
On 22 May 2015 19:03, Karlson ksonsp
add an alias as
follows:
from pyspark.sql.functions import *
df.alias(a).join(df.alias(b), col(a.col1) == col(b.col1))
On Tue, Apr 21, 2015 at 8:10 AM, Karlson ksonsp...@siberie.de wrote:
Sorry, my code actually was
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3
= df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2
are
different, so join is yielding to cartesian.
Best
Ayan
On Wed, Apr 22, 2015 at 12:42 AM, Karlson ksonsp...@siberie.de wrote:
Hi,
can anyone confirm
Hi all,
I am running the Python process that communicates with Spark in a
virtualenv. Is there any way I can make sure that the Python processes
of the workers are also started in a virtualenv? Currently I am getting
ImportErrors when the worker tries to unpickle stuff that is not
installed
Hi all,
what would happen if I save a RDD via saveAsParquetFile to the same path
that RDD is originally read from? Is that a safe thing to do in Pyspark?
Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
,org.apache.spark.OneToOneDependency@7bc172ec)
(d3 ShuffledRDD[12] at groupByKey at
console:12,org.apache.spark.ShuffleDependency@d794984)
(MappedRDD[11] at map at
console:12,org.apache.spark.OneToOneDependency@15c98005)
On Thu, Feb 12, 2015 at 10:05 AM, Karlson ksonsp...@siberie.de
wrote:
Hi Imran
))
?
As I understand, this would preserve the original partitioning.
On 2015-02-13 12:43, Karlson wrote:
Does that mean partitioning does not work in Python? Or does this only
effect joining?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implemented
Hi,
I believe that partitionBy will use the same (default) partitioner on
both RDDs.
On 2015-02-12 17:12, Sean Owen wrote:
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com
wrote:
Hi Karlson,
I think your
.
What's the explanation for that behaviour? Where am I wrong with my
assumption?
Thanks in advance,
Karlson
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
13 matches
Mail list logo