Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage
Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the streaming behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-11 Thread Harry Brundage
We've had success with Azkaban from LinkedIn over Oozie and Luigi. http://azkaban.github.io/ Azkaban has support for many different job types, a fleshed out web UI with decent log reporting, a decent failure / retry model, a REST api, and I think support for multiple executor slaves is coming in

Repartition to data-size per partition

2014-11-09 Thread Harry Brundage
I want to avoid the small files problem when using Spark, without having to manually calibrate a `repartition` at the end of each Spark application I am writing, since the amount of data passing through sadly isn't all that predictable. We're picking up from and writing data to HDFS. I know other

Is deferred execution of multiple RDDs ever coming?

2014-07-21 Thread Harry Brundage
Hello fellow Sparkians. In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ, Matei suggested that Spark might get deferred grouping and forced execution of multiple jobs in an efficient way. His code sample: rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]

Re: trouble with join on large RDDs

2014-04-14 Thread Harry Brundage
Brad: did you ever manage to figure this out? We're experiencing similar problems, and have also found that the memory limitations supplied to the Java side of PySpark don't limit how much memory Python can consume (which makes sense). Have you profiled the datasets you are trying to join? Is