Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage
Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the streaming
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?

On Tue, Dec 16, 2014 at 1:22 PM, Jim Carroll jimfcarr...@gmail.com wrote:

 Is there a way to get Spark to NOT reparition/shuffle/expand a
 sc.textFile(fileUri) when the URI is a gzipped file?

 Expanding a gzipped file should be thought of as a transformation and not
 an action (if the analogy is apt). There is no need to fully create and
 fill out an intermediate RDD with the expanded data when it can be done one
 row at a time.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-handling-of-a-file--gz-Uri-tp20726.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-11 Thread Harry Brundage
We've had success with Azkaban from LinkedIn over Oozie and Luigi.
http://azkaban.github.io/

Azkaban has support for many different job types, a fleshed out web UI with
decent log reporting, a decent failure / retry model, a REST api, and I
think support for multiple executor slaves is coming in the future. We
found Oozie's configuration and execution cumbersome, and Luigi immature.

On Tue, Nov 11, 2014 at 2:50 AM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi again,

 As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
 seems that Oozie is the best and only choice here. Is that right?

 On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com
 wrote:

 I have used Oozie for all our workflows with Spark apps but you will have
 to use a java event as the workflow element.   I am interested in anyones
 experience with Luigi and/or any other tools.


 On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 I have some previous experience with Apache Oozie while I was developing
 in Apache Pig. Now, I am working explicitly with Apache Spark and I am
 looking for a tool with similar functionality. Is Oozie recommended? What
 about Luigi? What do you use \ recommend?




 --


 Nothing under the sun is greater than education. By educating one person
 and sending him/her into the society of his/her generation, we make a
 contribution extending a hundred generations to come.
 -Jigoro Kano, Founder of Judo-





Repartition to data-size per partition

2014-11-09 Thread Harry Brundage
I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.

I know other tools like Pig can set the number of reducers and thus the
number of output partitions for you based on the size of the input data,
but I want to know if anyone else has a better way to do this with Spark's
primitives.

Right now we have an ok solution but it is starting to break down. We cache
our output RDD at the end of the application's flow, and then map over once
more it to guess what size it will be when pickled and gzipped (we're in
pyspark), and then compute a number to repartition to using a target
partition size. The problem is that we want to work with datasets bigger
than what will comfortably fit in the cache. Just spit balling here, but
what would be amazing is the ability to ask Spark how big it thinks each
partition might be, or the ability to give an accumulator as an argument to
`repartition` who's value wouldn't be used until the stage prior had
finished, or the ability to just have Spark repartition to a target
partition size for us.

Thanks for any help you can give me!


Is deferred execution of multiple RDDs ever coming?

2014-07-21 Thread Harry Brundage
Hello fellow Sparkians.

In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ,
Matei suggested that Spark might get deferred grouping and forced execution
of multiple jobs in an efficient way. His code sample:

rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]
rdd.reduceLater(reduceFunction2) // returns Future[ResultType2]
SparkContext.force() // executes all the later operations as part of a
single optimized job

This would be immensely useful. If you ever want to do a thing where you do
two passes over the data and save two different results to disk, you either
have to cache the RDD which can be slow or deprive the processing code of
memory, or recompute the whole thing twice. If Spark was smart enough to
let you group together these operations and fork an RDD (say an
RDD.partition method), you could very easily implement these n-pass
operations across RDDs and have spark execute them efficiently.

Our use case for a feature like this is processing many records and
attaching metadata to the records during processing about our confidence in
the data-points, and then writing the data to one spot and the metadata to
another spot.

I've also wanted this for taking a dataset, profiling it for partition size
or anomalously sized partitions, and then using the profiling result to
repartition the data before saving it to disk, which I think is impossible
to do without caching right now. This use case is a bit more interesting
because information from earlier on in the DAG needs to influence later
stages, and so I suspect the answer will be cache the thing. I explicitly
don't want to cache it because I'm not really doing an iterative
algorithm where I'm willing to pay the heap and time penalties, I'm just
doing an operation which needs run-time information without a collect call.
This suggests that something like a repartition with a lazily evaluated
accumulator might work as well, but I haven't been able to figure out a
solution even with this primitive and the current APIs.

So, does anyone know if this feature might land, and if not, where to start
implementing it? What would the Python story for Futures be?


Re: trouble with join on large RDDs

2014-04-14 Thread Harry Brundage
Brad: did you ever manage to figure this out? We're experiencing similar
problems, and have also found that the memory limitations supplied to the
Java side of PySpark don't limit how much memory Python can consume (which
makes sense). 

Have you profiled the datasets you are trying to join? Is there any skew
to the data where a handful of join key values occur far more often than the
rest of the values? Note that the join key in PySpark is computed by default
using the python `hash` function which for non-builtin values can have
unexpected behaviour. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.