"java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread rok
When running the connectedComponents algorithm in GraphFrames on a sufficiently large dataset, I get the following error I have not encountered before: 17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID 53644, 172.19.1.206, executor 40): java.lang.IllegalStateException:

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-08 Thread rok
how much of the execution time is used up by cPickle.load() and .dump() methods. Hope that helps, Rok On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] < ml-node+s1001560n28468...@n3.nabble.com> wrote: > > Hi all, > > I am trying to analyze PySpark pe

Re: Is stddev not a supported aggregation function in SparkSQL WindowSpec?

2016-02-18 Thread rok
There is a stddev function since 1.6: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev If you are using spark < 1.6 you can write your own more or less easily. On Wed, Feb 17, 2016 at 5:06 PM, mayx [via Apache Spark User List] <

spark metrics in graphite missing for some executors

2015-12-11 Thread rok
it be fixed? Right now it's rendering many metrics useless since I want to have a complete view into the application and I'm only seeing a few executors at a time. Thanks, rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-metrics-in-graphite-missing

Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" <dav...@databricks.com> wrote: > Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar <rokros...@gmail.com> w

Re: very slow parquet file write

2015-11-06 Thread Rok Roskar
schema consisting of a StructType with a few StructField floats and a string. I’m using all the spark defaults for io compression. I'll see what I can do about running a profiler -- can you point me to a resource/example? Thanks, Rok ps: my post on the mailing list is still listed as not accepted

very slow parquet file write

2015-11-05 Thread Rok Roskar
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no obvious effect). I was expecting the dump to

very slow parquet file write

2015-11-05 Thread rok
Apologies if this appears a second time! I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-30 Thread rok
zipWithIndex gives you global indices, which is not what you want. You'll want to use flatMap with a map function that iterates through each iterable and returns the (String, Int, String) tuple for each element. On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List]

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-24 Thread Rok Roskar
/24 08:10:59 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1437724871597_0001_01 15/07/24 08:11:00 INFO spark.SecurityManager: Changing view acls to: root,rok 15/07/24 08:11:00 INFO spark.SecurityManager: Changing modify acls to: root,rok 15/07/24 08:11:00 INFO

problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-22 Thread rok
I am trying to run Spark applications with the driver running locally and interacting with a firewalled remote cluster via a SOCKS proxy. I have to modify the hadoop configuration on the *local machine* to try to make this work, adding property

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Rok Roskar
for you to send the logs. Imran On Fri, May 15, 2015 at 10:07 AM, rok rokros...@gmail.com wrote: I am trying to sort a collection of key,value pairs (between several hundred million to a few billion) and have recently been getting lots of FetchFailedException errors that seem to originate

FetchFailedException and MetadataFetchFailedException

2015-05-15 Thread rok
I am trying to sort a collection of key,value pairs (between several hundred million to a few billion) and have recently been getting lots of FetchFailedException errors that seem to originate when one of the executors doesn't seem to find a temporary shuffle file on disk. E.g.:

Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote: What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
are being ignored and instead starting the driver with just 512Mb of heap? On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar rokros...@gmail.com wrote: the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree

Re: java.util.NoSuchElementException: key not found:

2015-03-02 Thread Rok Roskar
, Shixiong Zhu zsxw...@gmail.com wrote: RDD is not thread-safe. You should not use it in multiple threads. Best Regards, Shixiong Zhu 2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com: I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run

java.util.NoSuchElementException: key not found:

2015-02-27 Thread rok
I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run operations on an RDD from multiple threads in a python application. It ends up shutting down the SparkContext so I'm assuming this is a bug -- from what I understand, I should be able to run

cannot connect to Spark Application Master in YARN

2015-02-18 Thread rok
. Are the applications somehow binding to the wrong ports? Is this a spark setting I need to configure or something within YARN? Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-connect-to-Spark-Application-Master-in-YARN-tp21699.html Sent from

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
, 0.6606453820150496, 0.8610156719813942, 0.6971353266345091, 0.9896836700210551, 0.05789392881996358] Is there a size limit for objects serialized with Kryo? Or an option that controls it? The Java serializer works fine. On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar rokros...@gmail.com wrote

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote: I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote: On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
map is slower than the one before. I'll try the checkpoint, thanks for the suggestion. On Feb 12, 2015, at 12:13 AM, Davies Liu dav...@databricks.com wrote: On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar rokros...@gmail.com wrote: the runtime for each consecutive iteration is still roughly twice

pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread rok
next. Is there a limit to the size of broadcast variables? This one is rather large (a few Gb dict). Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html Sent from

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
10, 2015 at 10:42 AM, rok rokros...@gmail.com wrote: I'm trying to use a broadcasted dictionary inside a map function and am consistently getting Java null pointer exceptions. This is inside an IPython session connected to a standalone spark cluster. I seem to recall being able to do this before

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
if this was somehow automatically corrected! Thanks, Rok On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu dav...@databricks.com wrote: HadoopRDD will try to split the file as 64M partitions in size, so you got 1916+ partitions. (assume 100k per row, they are 80G in size). I think it has very small chance

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
+ partitions (I don't know why actually -- how does this number come about?). How can I check if any objects/batches are exceeding 2Gb? Thanks, Rok On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu dav...@databricks.com wrote: Maybe it's caused by integer overflow, is it possible that one object

NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread rok
) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Not really sure where to start looking for the culprit -- any suggestions most welcome. Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Rok Roskar
at 3:46 AM, Rok Roskar rokros...@gmail.com wrote: thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2

calculating the mean of SparseVector RDD

2015-01-07 Thread rok
I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def

ensuring RDD indices remain immutable

2014-12-01 Thread rok
I have an RDD that serves as a feature look-up table downstream in my analysis. I create it using the zipWithIndex() and because I suppose that the elements of the RDD could end up in a different order if it is regenerated at any point, I cache it to try and ensure that the (feature -- index)

Re: ensuring RDD indices remain immutable

2014-12-01 Thread rok
true though I was hoping to avoid having to sort... maybe there's no way around it. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html Sent from the Apache Spark User List mailing list archive at

minimizing disk I/O

2014-11-13 Thread rok
I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd like to reduce the number of files that are being written during shuffle operations. A couple questions: * is the amount of file I/O performed independent of the memory I allocate for the shuffles? * if this is the

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-13 Thread rok
Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears as if the job never gets properly dispatched -- the only message is the Broken Pipe message in the driver. -- View this message in context:

using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread rok
) error: [Errno 32] Broken pipe I'm at a loss as to where to begin to debug this... any suggestions? Thanks, Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182.html Sent from

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread rok
yes, the training set is fine, I've verified it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html Sent from the Apache Spark User List mailing list archive at

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok
I'm processing some data using PySpark and I'd like to save the RDDs to disk (they are (k,v) RDDs of strings and SparseVector types) and read them in using Scala to run them through some other analysis. Is this possible? Thanks, Rok -- View this message in context: http://apache-spark-user

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
this strange redistribution of elements? I'm obviously misunderstanding how spark does the partitioning -- is it a problem with having a list of strings as an RDD? Help vey much appreciated! Thanks, Rok - To unsubscribe, e

out of memory errors -- per core memory limits?

2014-08-21 Thread Rok Roskar
is setting the per-core memory limit somewhere? Thanks, Rok - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Rok Roskar
(path, start, length): f = open(path) f.seek(start) data = f.read(length) #processing the data rdd = sc.parallelize(partitions, len(partitions)).flatMap(read_chunk) right... this is totally obvious in retrospect! Thanks! Rok * related to the first question -- when

PySpark, numpy arrays and binary data

2014-08-06 Thread Rok Roskar
tips anyone might have for running pyspark with numpy data...! Thanks! Rok