Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the "streaming"
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?

On Tue, Dec 16, 2014 at 1:22 PM, Jim Carroll <jimfcarr...@gmail.com> wrote:
>
> Is there a way to get Spark to NOT reparition/shuffle/expand a
> sc.textFile(fileUri) when the URI is a gzipped file?
>
> Expanding a gzipped file should be thought of as a "transformation" and not
> an "action" (if the analogy is apt). There is no need to fully create and
> fill out an intermediate RDD with the expanded data when it can be done one
> row at a time.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-handling-of-a-file-xxxx-gz-Uri-tp20726.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to