Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the "streaming" behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?
On Tue, Dec 16, 2014 at 1:22 PM, Jim Carroll <jimfcarr...@gmail.com> wrote: > > Is there a way to get Spark to NOT reparition/shuffle/expand a > sc.textFile(fileUri) when the URI is a gzipped file? > > Expanding a gzipped file should be thought of as a "transformation" and not > an "action" (if the analogy is apt). There is no need to fully create and > fill out an intermediate RDD with the expanded data when it can be done one > row at a time. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-handling-of-a-file-xxxx-gz-Uri-tp20726.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >