Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the streaming
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?
Hi Harry,
Thanks for your response.
I'm working in scala. When I do a count call it expands the RDD in the
count (since it's an action). You can see the call stack that results in
the failure of the job here:
ERROR DiskBlockObjectWriter - Uncaught exception while reverting
partial writes