Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim Carroll
Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a transformation and not an action (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD with

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage
Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the streaming behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim
Hi Harry, Thanks for your response. I'm working in scala. When I do a count call it expands the RDD in the count (since it's an action). You can see the call stack that results in the failure of the job here: ERROR DiskBlockObjectWriter - Uncaught exception while reverting partial writes