subject:"Re\: Spark handling of a file\:\/\/xxxx.gz Uri"

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage

Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the streaming behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim

Hi Harry, Thanks for your response. I'm working in scala. When I do a count call it expands the RDD in the count (since it's an action). You can see the call stack that results in the failure of the job here: ERROR DiskBlockObjectWriter - Uncaught exception while reverting partial writes