[jira] [Commented] (SPARK-17636) Parquet filter push down doesn't handle struct fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519568#comment-16519568 ] Justin Miller commented on SPARK-17636: --- Any updates on this ticket? Thanks! > Parquet filter push down doesn't handle struct fields > - > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Priority: Minor > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338032#comment-16338032 ] Justin Miller commented on SPARK-17147: --- I'm also seeing this behavior on a topic that has cleanup.policy=delete. The volume on this topic is very large, > 10 billion messages per day, and it seems to happen about once per day. Another topic with lower volume but larger messages happens every few days. 18/01/23 18:30:10 WARN TaskSetManager: Lost task 28.0 in stage 26.0 (TID 861, ,executor 15): java.lang.AssertionError: assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662 18/01/23 18:30:12 INFO TaskSetManager: Lost task 28.1 in stage 26.0 (TID 865) on ,executor 24: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 1] 18/01/23 18:30:14 INFO TaskSetManager: Lost task 28.2 in stage 26.0 (TID 866) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 2] 18/01/23 18:30:15 INFO TaskSetManager: Lost task 28.3 in stage 26.0 (TID 867) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 3] 18/01/23 18:30:18 WARN TaskSetManager: Lost task 28.0 in stage 27.0 (TID 898, ,executor 6): java.lang.AssertionError: assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662 18/01/23 18:30:19 INFO TaskSetManager: Lost task 28.1 in stage 27.0 (TID 900) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 1] 18/01/23 18:30:20 INFO TaskSetManager: Lost task 28.2 in stage 27.0 (TID 901) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 2] 18/01/23 18:30:21 INFO TaskSetManager: Lost task 28.3 in stage 27.0 (TID 902) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 3] When checked with kafka-simple-consumer-shell, the offset is in fact missing: next offset = 1769485661 next offset = 1769485663 next offset = 1769485664 next offset = 1769485665 I'm currently testing out this branch in the persister and will post if it crashes again over the next few days (I currently have the kafka-10 source from the branch with a few extra log lines deployed). We're currently on log format 0.10.2 (upgraded yesterday) but saw the same issue on 0.9.0.0. chao.wu - Is this behavior similar to what you're seeing? > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly
Justin Miller created SPARK-19888: - Summary: Seeing offsets not resetting even when reset policy is configured explicitly Key: SPARK-19888 URL: https://issues.apache.org/jira/browse/SPARK-19888 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Justin Miller I was told to post this in a Spark ticket from KAFKA-4396: I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be two separate errors, I'm not sure. What's puzzling is that I'm setting auto.offset.reset to latest and it's still throwing an OffsetOutOfRangeException, behavior that's contrary to the code. Please help! :) {code} val kafkaParams = Map[String, Object]( "group.id" -> consumerGroup, "bootstrap.servers" -> bootstrapServers, "key.deserializer" -> classOf[ByteArrayDeserializer], "value.deserializer" -> classOf[MessageRowDeserializer], "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean), "max.poll.records" -> persisterConfig.maxPollRecords, "request.timeout.ms" -> persisterConfig.requestTimeoutMs, "session.timeout.ms" -> persisterConfig.sessionTimeoutMs, "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs, "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs ) {code} {code} 16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory on xyz (size: 146.3 KB, free: 8.4 GB) 16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic=231884473} at org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588) at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 39388) in 12043 ms on xyz (1/16) 16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 39375) in 13444 ms on xyz (2/16) 16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 38843, xyz): java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586422#comment-15586422 ] Justin Miller commented on SPARK-17147: --- K I'll try to assemble everything I've seen so far around it and post it to the list. Thanks! > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586403#comment-15586403 ] Justin Miller commented on SPARK-17147: --- OK thank you. Might be related to the thousands of partitions we have spread across hundreds of topics. "Oops". > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586387#comment-15586387 ] Justin Miller commented on SPARK-17147: --- Log compaction? Only for our offset topics. > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584169#comment-15584169 ] Justin Miller commented on SPARK-17147: --- Could this possibly be related to why I'm seeing the following? 16/10/18 02:11:02 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 5823, ip-172-20-222-162.int.protectwise.net): java.lang.IllegalStateException: This consumer has already been closed. at org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417) at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575945#comment-15575945 ] Justin Miller commented on SPARK-17936: --- Hey Sean, I did a bit more digging this morning looking at SpecificUnsafeProjection and saw this commit: https://github.com/apache/spark/commit/b1b47274bfeba17a9e4e9acebd7385289f31f6c8 I thought I'd try running w/2.1.0-SNAPSHOT and see how things went and it appears to work great now! [Stage 1:> (0 + 8) / 8]11:28:33.237 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 0 / Thrift Parse Errors: 0 [Stage 3:> (0 + 8) / 8]11:29:03.236 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 89 / Thrift Parse Errors: 0 [Stage 5:> (4 + 4) / 8]11:29:33.237 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 205 / Thrift Parse Errors: 0 Since we're still testing this out that snapshot works great for now. Do you know when 2.1.0 might be available generally? Best, Justin > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Miller reopened SPARK-17936: --- I don't believe this is a duplicate. This occurs while trying to write to parquet (with very little data) and happens almost immediately after batchDuration. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575494#comment-15575494 ] Justin Miller commented on SPARK-17936: --- It's just strange that the issue seems to effect previous versions (if they're the same issue) but didn't impact me when I was using 1.6.2 and the 0.8 kafka consumer. Is it possible that Scala 2.10 vs Scala 2.11 makes a difference? There are a lot of variables at play unfortunately. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Miller updated SPARK-17936: -- Comment: was deleted (was: I did look through them and I don't think they're related. Note that the error is different and this is trying to write data not read large amounts of data.) > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575471#comment-15575471 ] Justin Miller commented on SPARK-17936: --- I'd also note this wasn't an issue in Spark 1.6.2. The process would run fine for hours and never crashed on this error. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575452#comment-15575452 ] Justin Miller commented on SPARK-17936: --- I did look through them and I don't think they're related. Note that the error is different and this is trying to write data not read large amounts of data. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
Justin Miller created SPARK-17936: - Summary: "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error Key: SPARK-17936 URL: https://issues.apache.org/jira/browse/SPARK-17936 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Reporter: Justin Miller Greetings. I'm currently in the process of migrating a project I'm working on from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift structs coming from Kafka into Parquet files stored in S3. This conversion process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll paste the stack trace below. org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:854) at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) Also, later on: 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-6,5,run-main-group-0] java.lang.OutOfMemoryError: Java heap space I've seen similar issues posted, but those were always on the query side. I have a hunch that this is happening at write time as the error occurs after batchDuration. Here's the write snippet. stream. flatMap { case Success(row) => thriftParseSuccess += 1 Some(row) case Failure(ex) => thriftParseErrors += 1 logger.error("Error during deserialization: ", ex) None }.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.context) transformer(sqlContext.createDataFrame(rdd, converter.schema)) .coalesce(coalesceSize) .write .mode(Append) .partitionBy(partitioning: _*) .parquet(parquetPath) } Please let me know if you can be of assistance and if there's anything I can do to help. Best, Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org