[jira] [Commented] (SPARK-27267) Snappy 1.1.7.1 fails when decompressing empty serialized data
[ https://issues.apache.org/jira/browse/SPARK-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805319#comment-16805319 ] Taro L. Saito commented on SPARK-27267: --- [~srowen] Actually this java9 support is for SnappyFramedStream, which is not used in Spark. I'm now testing snappy-java using jdk11 [https://github.com/xerial/snappy-java/pull/230] > Snappy 1.1.7.1 fails when decompressing empty serialized data > - > > Key: SPARK-27267 > URL: https://issues.apache.org/jira/browse/SPARK-27267 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.4.0 > Environment: spark.rdd.compress=true > spark.io.compression.codec =snappy > spark 2.4 in hadoop 2.6 with hive >Reporter: Max Xie >Assignee: Max Xie >Priority: Minor > > I use pyspark like that > ``` > from pyspark.storagelevel import StorageLevel > df=spark.sql("select * from xzn.person") > df.persist(StorageLevel(False, True, False, False)) > df.count() > ``` > table person is a simple table stored as orc files and some orc files is > empty. When I run the query, it throw the error : > ``` > 19/03/22 21:46:31 INFO MemoryStore:54 - Block rdd_2_1 stored as values in > memory (estimated size 0.0 B, free 1662.6 MB) > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00011, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00011_copy_1, range: 0-49, partition values: > [empty row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00012, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00012_copy_1, range: 0-49, partition values: > [empty row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00013, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 ERROR Executor:91 - Exception in task 1.0 in stage 0.0 (TID > 1) > org.xerial.snappy.SnappyIOException: [EMPTY_INPUT] Cannot decompress empty > stream > at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:59) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164) > at > org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163) > at > org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ``` > After I search it, I find that 1.1.7.x snappy-java 's behavior is different > from 1.1.2.x (that spark 2.0.2 use this version). SnappyOutputStream in > 1.1.2.x version always writes a snappy header whether or not to write a > value, but SnappyOutputStream in 1.1.7.x don't generate header if u don't > write value into it, so in
[jira] [Commented] (SPARK-27267) spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from 1.1.2.x
[ https://issues.apache.org/jira/browse/SPARK-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801366#comment-16801366 ] Taro L. Saito commented on SPARK-27267: --- Just released snappy-java 1.1.7.3 with this hot fix. https://github.com/xerial/snappy-java > spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from > 1.1.2.x > > > Key: SPARK-27267 > URL: https://issues.apache.org/jira/browse/SPARK-27267 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.4.0 > Environment: spark.rdd.compress=true > spark.io.compression.codec =snappy > spark 2.4 in hadoop 2.6 with hive >Reporter: Max Xie >Priority: Minor > > I use pyspark like that > ``` > from pyspark.storagelevel import StorageLevel > df=spark.sql("select * from xzn.person") > df.persist(StorageLevel(False, True, False, False)) > df.count() > ``` > table person is a simple table stored as orc files and some orc files is > empty. When I run the query, it throw the error : > ``` > 19/03/22 21:46:31 INFO MemoryStore:54 - Block rdd_2_1 stored as values in > memory (estimated size 0.0 B, free 1662.6 MB) > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00011, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00011_copy_1, range: 0-49, partition values: > [empty row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00012, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00012_copy_1, range: 0-49, partition values: > [empty row] > 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: > viewfs://name/xzn.db/person/part-00013, range: 0-49, partition values: [empty > row] > 19/03/22 21:46:31 ERROR Executor:91 - Exception in task 1.0 in stage 0.0 (TID > 1) > org.xerial.snappy.SnappyIOException: [EMPTY_INPUT] Cannot decompress empty > stream > at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:59) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164) > at > org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163) > at > org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ``` > After I search it, I find that 1.1.7.x snappy-java 's behavior is different > from 1.1.2.x (that spark 2.0.2 use this version). SnappyOutputStream in > 1.1.2.x version always writes a snappy header whether or not to write a > value, but SnappyOutputStream in 1.1.7.x don't generate header if u don't > write value into it, so in spark 2.4 if RDD cache a empty value, memoryStore > will not cache any bytes ( no
[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649090#comment-15649090 ] Taro L. Saito commented on SPARK-14540: --- I'm also hitting a similar problem in my dependency injection library for Scala: https://github.com/wvlet/airframe/pull/39. I feel ClosureCleaner like functionality is necessary in Scala 2.12 itself. > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472 ] Taro L. Saito edited comment on SPARK-5928 at 3/7/16 2:24 AM: -- FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports showing when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything I can work on. was (Author: taroleo): FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything that I can work on. > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at >
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182472#comment-15182472 ] Taro L. Saito commented on SPARK-5928: -- FYI. I created LArray library that can handle data larger than 2GB, which is the limit of Java byte arrays and mmap files: https://github.com/xerial/larray It looks like there are several reports when this 2GB limit can be problematic (especially in processing Spark SQL): http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska/29 Let me know if there is anything that I can work on. > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at >
[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497899#comment-14497899 ] Taro L. Saito commented on SPARK-3937: -- I think this error occurs when wrong memory position or corrupted data is read by snappy. I would like to check this binary data read by SnappyInputStream. Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-2881) Snappy is now default codec - could lead to conflicts since uses /tmp
[ https://issues.apache.org/jira/browse/SPARK-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099870#comment-14099870 ] Taro L. Saito commented on SPARK-2881: -- This problem is fixed in snappy-java 1.1.1.3. But for your convenience, I applied a hot fix and released 1.0.5.3: https://github.com/xerial/snappy-java/commit/89277ddb7a9982126d444af3a290a1d68953ac66 Which will be available soon in Maven central: http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ Snappy is now default codec - could lead to conflicts since uses /tmp - Key: SPARK-2881 URL: https://issues.apache.org/jira/browse/SPARK-2881 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Patrick Wendell Priority: Blocker I was using spark master branch and I ran into an issue with Snappy since its now the default codec for shuffle. The issue was that someone else had run with snappy and it created /tmp/snappy-*.so but it had restrictive permissions so I was not able to use it or remove it. This caused my spark job to not start. I was running in yarn client mode at the time. Yarn cluster mode shouldn't have this issue since we change the java.io.tmpdir. I assume this would also affect standalone mode. I'm not sure if this is a true blocker but wanted to file it as one at first and let us decide. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2881) Snappy is now default codec - could lead to conflicts since uses /tmp
[ https://issues.apache.org/jira/browse/SPARK-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099870#comment-14099870 ] Taro L. Saito edited comment on SPARK-2881 at 8/17/14 3:31 AM: --- This problem is fixed in snappy-java 1.1.1.3. But for your convenience, I applied a hot fix and released 1.0.5.3: https://github.com/xerial/snappy-java/commit/89277ddb7a9982126d444af3a290a1d68953ac66 This version 1.0.5.3 is now available in Maven central: http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ was (Author: taroleo): This problem is fixed in snappy-java 1.1.1.3. But for your convenience, I applied a hot fix and released 1.0.5.3: https://github.com/xerial/snappy-java/commit/89277ddb7a9982126d444af3a290a1d68953ac66 Which will be available soon in Maven central: http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ Snappy is now default codec - could lead to conflicts since uses /tmp - Key: SPARK-2881 URL: https://issues.apache.org/jira/browse/SPARK-2881 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Patrick Wendell Priority: Blocker I was using spark master branch and I ran into an issue with Snappy since its now the default codec for shuffle. The issue was that someone else had run with snappy and it created /tmp/snappy-*.so but it had restrictive permissions so I was not able to use it or remove it. This caused my spark job to not start. I was running in yarn client mode at the time. Yarn cluster mode shouldn't have this issue since we change the java.io.tmpdir. I assume this would also affect standalone mode. I'm not sure if this is a true blocker but wanted to file it as one at first and let us decide. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org