Is there a way to get Spark to NOT reparition/shuffle/expand a
sc.textFile(fileUri) when the URI is a gzipped file?
Expanding a gzipped file should be thought of as a transformation and not
an action (if the analogy is apt). There is no need to fully create and
fill out an intermediate RDD with
Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the streaming
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?
Hi Harry,
Thanks for your response.
I'm working in scala. When I do a count call it expands the RDD in the
count (since it's an action). You can see the call stack that results in
the failure of the job here:
ERROR DiskBlockObjectWriter - Uncaught exception while reverting
partial writes