Re: Spark handling of a file://xxxx.gz Uri
Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the streaming behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark? On Tue, Dec 16, 2014 at 1:22 PM, Jim Carroll jimfcarr...@gmail.com wrote: Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a transformation and not an action (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD with the expanded data when it can be done one row at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-handling-of-a-file--gz-Uri-tp20726.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: which is the recommended workflow engine for Apache Spark jobs?
We've had success with Azkaban from LinkedIn over Oozie and Luigi. http://azkaban.github.io/ Azkaban has support for many different job types, a fleshed out web UI with decent log reporting, a decent failure / retry model, a REST api, and I think support for multiple executor slaves is coming in the future. We found Oozie's configuration and execution cumbersome, and Luigi immature. On Tue, Nov 11, 2014 at 2:50 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi again, As Jimmy said, any thoughts about Luigi and/or any other tools? So far it seems that Oozie is the best and only choice here. Is that right? On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com wrote: I have used Oozie for all our workflows with Spark apps but you will have to use a java event as the workflow element. I am interested in anyones experience with Luigi and/or any other tools. On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend? -- Nothing under the sun is greater than education. By educating one person and sending him/her into the society of his/her generation, we make a contribution extending a hundred generations to come. -Jigoro Kano, Founder of Judo-
Repartition to data-size per partition
I want to avoid the small files problem when using Spark, without having to manually calibrate a `repartition` at the end of each Spark application I am writing, since the amount of data passing through sadly isn't all that predictable. We're picking up from and writing data to HDFS. I know other tools like Pig can set the number of reducers and thus the number of output partitions for you based on the size of the input data, but I want to know if anyone else has a better way to do this with Spark's primitives. Right now we have an ok solution but it is starting to break down. We cache our output RDD at the end of the application's flow, and then map over once more it to guess what size it will be when pickled and gzipped (we're in pyspark), and then compute a number to repartition to using a target partition size. The problem is that we want to work with datasets bigger than what will comfortably fit in the cache. Just spit balling here, but what would be amazing is the ability to ask Spark how big it thinks each partition might be, or the ability to give an accumulator as an argument to `repartition` who's value wouldn't be used until the stage prior had finished, or the ability to just have Spark repartition to a target partition size for us. Thanks for any help you can give me!
Is deferred execution of multiple RDDs ever coming?
Hello fellow Sparkians. In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ, Matei suggested that Spark might get deferred grouping and forced execution of multiple jobs in an efficient way. His code sample: rdd.reduceLater(reduceFunction1) // returns Future[ResultType1] rdd.reduceLater(reduceFunction2) // returns Future[ResultType2] SparkContext.force() // executes all the later operations as part of a single optimized job This would be immensely useful. If you ever want to do a thing where you do two passes over the data and save two different results to disk, you either have to cache the RDD which can be slow or deprive the processing code of memory, or recompute the whole thing twice. If Spark was smart enough to let you group together these operations and fork an RDD (say an RDD.partition method), you could very easily implement these n-pass operations across RDDs and have spark execute them efficiently. Our use case for a feature like this is processing many records and attaching metadata to the records during processing about our confidence in the data-points, and then writing the data to one spot and the metadata to another spot. I've also wanted this for taking a dataset, profiling it for partition size or anomalously sized partitions, and then using the profiling result to repartition the data before saving it to disk, which I think is impossible to do without caching right now. This use case is a bit more interesting because information from earlier on in the DAG needs to influence later stages, and so I suspect the answer will be cache the thing. I explicitly don't want to cache it because I'm not really doing an iterative algorithm where I'm willing to pay the heap and time penalties, I'm just doing an operation which needs run-time information without a collect call. This suggests that something like a repartition with a lazily evaluated accumulator might work as well, but I haven't been able to figure out a solution even with this primitive and the current APIs. So, does anyone know if this feature might land, and if not, where to start implementing it? What would the Python story for Futures be?
Re: trouble with join on large RDDs
Brad: did you ever manage to figure this out? We're experiencing similar problems, and have also found that the memory limitations supplied to the Java side of PySpark don't limit how much memory Python can consume (which makes sense). Have you profiled the datasets you are trying to join? Is there any skew to the data where a handful of join key values occur far more often than the rest of the values? Note that the join key in PySpark is computed by default using the python `hash` function which for non-builtin values can have unexpected behaviour. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.