Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
oducer and even couldn't reproduce even > they spent their time. Memory leak issue is not really easy to reproduce, > unless it leaks some objects without any conditions. > > - Jungtaek Lim (HeartSaVioR) > > On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote: >> >> Dear

pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-20 Thread Paul Wais
Dear List, I've observed some sort of memory leak when using pyspark to run ~100 jobs in local mode. Each job is essentially a create RDD -> create DF -> write DF sort of flow. The RDD and DFs go out of scope after each job completes, hence I call this issue a "memory leak." Here's pseudocode:

Avro support broken?

2019-07-04 Thread Paul Wais
Dear List, Has anybody gotten avro support to work in pyspark? I see multiple reports of it being broken on Stackoverflow and added my own repro to this ticket:

Perf impact of BlockManager byte[] copies

2015-02-27 Thread Paul Wais
Dear List, I'm investigating some problems related to native code integration with Spark, and while picking through BlockManager I noticed that data (de)serialization currently issues lots of array copies. Specifically: - Deserialization: BlockManager marshals all deserialized bytes through a

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais
/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais
To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me

Support for SQL on unions of tables (merge tables?)

2015-01-11 Thread Paul Wais
Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read each as a distinct SchemaRDD, but I

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List, Has anybody had experience integrating C/C++ code into Spark jobs? I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
taking memory. On Oct 30, 2014 6:43 PM, Paul Wais pw...@yelp.com javascript:; wrote: Dear Spark List, I have a Spark app that runs native code inside map functions. I've noticed that the native code sometimes sets errno to ENOMEM indicating a lack of available memory. However, I've

Re: Any issues with repartition?

2014-10-08 Thread Paul Wais
Looks like an OOM issue? Have you tried persisting your RDDs to allow disk writes? I've seen a lot of similar crashes in a Spark app that reads from HDFS and does joins. I.e. I've seen java.io.IOException: Filesystem closed, Executor lost, FetchFailed, etc etc with non-deterministic crashes.

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Derp, one caveat to my solution: I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais pw...@yelp.com wrote: Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just

Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
Dear List, I'm writing an application where I have RDDs of protobuf messages. When I run the app via bin/spar-submit with --master local --driver-class-path path/to/my/uber.jar, Spark is able to ser/deserialize the messages correctly. However, if I run WITHOUT --driver-class-path

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
/hadoop-project/pom.xml On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais pw...@yelp.com wrote: Dear List, I'm writing an application where I have RDDs of protobuf messages. When I run the app via bin/spar-submit with --master local --driver-class-path path/to/my/uber.jar, Spark is able to ser

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
* https://github.com/apache/spark/pull/181 * http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais pw

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
hmm would using kyro help me here? On Thursday, September 18, 2014, Paul Wais pw...@yelp.com wrote: Ah, can one NOT create an RDD of any arbitrary Serializable type? It looks like I might be getting bitten by the same java.io.ObjectInputStream uses root class loader only bugs mentioned

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74 If uber.jar is on the classpath, then the root classloader would have the code, hence why --driver-class-path fixes the bug. On Thu, Sep 18, 2014 at 5:42 PM, Paul Wais pw...@yelp.com wrote

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
--master yarn-client ... will fail. But on the personal build obtained from the command above, both will then work. -Christian On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote: Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
, when it shouldn't be packaged. Spark works out of the box with just about any modern combo of HDFS and YARN. On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais pw...@yelp.com wrote: Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles. In particular

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles. In particular, I'm seeing: Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master

Release date for new pyspark

2014-07-16 Thread Paul Wais
, -Paul Wais