xxxx.gz Uri

Jim Tue, 16 Dec 2014 14:15:13 -0800


Hi Harry,


Thanks for your response.

I'm working in scala. When I do a "count" call it expands the RDD in thecount (since it's an action). You can see the call stack that results inthe failure of the job here:

ERROR DiskBlockObjectWriter - Uncaught exception while revertingpartial writes to file/tmp/spark-local-20141216170458-964a/1d/temp_shuffle_4f46af09-5521-4fc6-adb1-c72839520560

java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:345)

atorg.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:86)atorg.apache.spark.storage.DiskBlockObjectWriter.org$apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:221)atorg.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:86)atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)

atorg.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:263)atjava.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)

    at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:718)

atorg.apache.spark.serializer.JavaSerializationStream.flush(JavaSerializer.scala:51)atorg.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:173)atorg.apache.spark.util.collection.ExternalSorter$$anonfun$stop$2.apply(ExternalSorter.scala:774)atorg.apache.spark.util.collection.ExternalSorter$$anonfun$stop$2.apply(ExternalSorter.scala:773)atscala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

atorg.apache.spark.util.collection.ExternalSorter.stop(ExternalSorter.scala:773)atorg.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:93)atorg.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)atorg.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

    at org.apache.spark.scheduler.Task.run(Task.scala:56)

atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:745)

Notice the task run (this is now doing a "count") results in a Shuffleduring which it writes the intermediate RDD to disk (and fails when thedisk is full). This intermediate RDD/disk write is unnecessary.

I even implemented a "Seq[String]" in terms of streaming the file andcalled sc.parallelize(mySequence,1) and THIS results in a call to"toArray" on my sequence. Since this wont fit on disk it certainly wontfit in an array in memory.


Thanks for taking the time to respond.

Jim

On 12/16/2014 04:57 PM, Harry Brundage wrote:

Are you certain that's happening Jim? Why? What happens if you just dosc.textFile(fileUri).count() ? If I'm not mistaken the HadoopInputFormat for gzip and the RDD wrapper around it already has the"streaming" behaviour you wish for. but I could be wrong. Also, areyou in pyspark or scala Spark?

On Tue, Dec 16, 2014 at 1:22 PM, Jim Carroll <jimfcarr...@gmail.com<mailto:jimfcarr...@gmail.com>> wrote:


    Is there a way to get Spark to NOT reparition/shuffle/expand a
    sc.textFile(fileUri) when the URI is a gzipped file?

    Expanding a gzipped file should be thought of as a
    "transformation" and not
    an "action" (if the analogy is apt). There is no need to fully
    create and
    fill out an intermediate RDD with the expanded data when it can be
    done one
    row at a time.




    --
    View this message in context:
    
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-handling-of-a-file-xxxx-gz-Uri-tp20726.html
    Sent from the Apache Spark User List mailing list archive at
    Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org
    <mailto:user-h...@spark.apache.org>

Re: Spark handling of a file://xxxx.gz Uri

Reply via email to