[jira] [Commented] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2018-06-21 Thread Justin Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519568#comment-16519568
 ] 

Justin Miller commented on SPARK-17636:
---

Any updates on this ticket? Thanks!

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-01-24 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338032#comment-16338032
 ] 

Justin Miller commented on SPARK-17147:
---

I'm also seeing this behavior on a topic that has cleanup.policy=delete. The 
volume on this topic is very large, > 10 billion messages per day, and it seems 
to happen about once per day. Another topic with lower volume but larger 
messages happens every few days.

18/01/23 18:30:10 WARN TaskSetManager: Lost task 28.0 in stage 26.0 (TID 861, 
,executor 15): java.lang.AssertionError: assertion failed: Got wrong record 
for -8002  124 even after seeking to offset 1769485661 
got back record.offset 1769485662
18/01/23 18:30:12 INFO TaskSetManager: Lost task 28.1 in stage 26.0 (TID 865) 
on ,executor 24: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 1]
18/01/23 18:30:14 INFO TaskSetManager: Lost task 28.2 in stage 26.0 (TID 866) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 2]
18/01/23 18:30:15 INFO TaskSetManager: Lost task 28.3 in stage 26.0 (TID 867) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 3]
18/01/23 18:30:18 WARN TaskSetManager: Lost task 28.0 in stage 27.0 (TID 898, 
,executor 6): java.lang.AssertionError: assertion failed: Got wrong record 
for -8002  124 even after seeking to offset 1769485661 
got back record.offset 1769485662
18/01/23 18:30:19 INFO TaskSetManager: Lost task 28.1 in stage 27.0 (TID 900) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 1]
18/01/23 18:30:20 INFO TaskSetManager: Lost task 28.2 in stage 27.0 (TID 901) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 2]
18/01/23 18:30:21 INFO TaskSetManager: Lost task 28.3 in stage 27.0 (TID 902) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 3]

When checked with kafka-simple-consumer-shell, the offset is in fact missing:

next offset = 1769485661
next offset = 1769485663
next offset = 1769485664
next offset = 1769485665

I'm currently testing out this branch in the persister and will post if it 
crashes again over the next few days (I currently have the kafka-10 source from 
the branch with a few extra log lines deployed). We're currently on log format 
0.10.2 (upgraded yesterday) but saw the same issue on 0.9.0.0.

chao.wu - Is this behavior similar to what you're seeing?

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-09 Thread Justin Miller (JIRA)
Justin Miller created SPARK-19888:
-

 Summary: Seeing offsets not resetting even when reset policy is 
configured explicitly
 Key: SPARK-19888
 URL: https://issues.apache.org/jira/browse/SPARK-19888
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Justin Miller


I was told to post this in a Spark ticket from KAFKA-4396:

I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be two 
separate errors, I'm not sure. What's puzzling is that I'm setting 
auto.offset.reset to latest and it's still throwing an 
OffsetOutOfRangeException, behavior that's contrary to the code. Please help! :)

{code}
val kafkaParams = Map[String, Object](
  "group.id" -> consumerGroup,
  "bootstrap.servers" -> bootstrapServers,
  "key.deserializer" -> classOf[ByteArrayDeserializer],
  "value.deserializer" -> classOf[MessageRowDeserializer],
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean),
  "max.poll.records" -> persisterConfig.maxPollRecords,
  "request.timeout.ms" -> persisterConfig.requestTimeoutMs,
  "session.timeout.ms" -> persisterConfig.sessionTimeoutMs,
  "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs,
  "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs
)
{code}

{code}
16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory 
on xyz (size: 146.3 KB, free: 8.4 GB)
16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 
38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
Offsets out of range with no configured reset policy for partitions: 
{topic=231884473}
at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 
39388) in 12043 ms on xyz (1/16)
16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 
39375) in 13444 ms on xyz (2/16)
16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 38843, 
xyz): java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-10-18 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586422#comment-15586422
 ] 

Justin Miller commented on SPARK-17147:
---

K I'll try to assemble everything I've seen so far around it and post it to the 
list. Thanks!

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-10-18 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586403#comment-15586403
 ] 

Justin Miller commented on SPARK-17147:
---

OK thank you. Might be related to the thousands of partitions we have spread 
across hundreds of topics. "Oops".

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-18 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586387#comment-15586387
 ] 

Justin Miller commented on SPARK-17147:
---

Log compaction? Only for our offset topics.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-17 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584169#comment-15584169
 ] 

Justin Miller commented on SPARK-17147:
---

Could this possibly be related to why I'm seeing the following?

16/10/18 02:11:02 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 5823, 
ip-172-20-222-162.int.protectwise.net): java.lang.IllegalStateException: This 
consumer has already been closed.
at 
org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575945#comment-15575945
 ] 

Justin Miller commented on SPARK-17936:
---

Hey Sean,

I did a bit more digging this morning looking at SpecificUnsafeProjection and 
saw this commit: 
https://github.com/apache/spark/commit/b1b47274bfeba17a9e4e9acebd7385289f31f6c8

I thought I'd try running w/2.1.0-SNAPSHOT and see how things went and it 
appears to work great now!

[Stage 1:> (0 + 8) / 8]11:28:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 0 / Thrift Parse Errors: 0
[Stage 3:> (0 + 8) / 8]11:29:03.236 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 89 / Thrift Parse Errors: 0
[Stage 5:> (4 + 4) / 8]11:29:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 205 / Thrift Parse Errors: 0

Since we're still testing this out that snapshot works great for now. Do you 
know when 2.1.0 might be available generally?

Best,
Justin


> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Miller reopened SPARK-17936:
---

I don't believe this is a duplicate. This occurs while trying to write to 
parquet (with very little data) and happens almost immediately after 
batchDuration.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575494#comment-15575494
 ] 

Justin Miller commented on SPARK-17936:
---

It's just strange that the issue seems to effect previous versions (if they're 
the same issue) but didn't impact me when I was using 1.6.2 and the 0.8 kafka 
consumer. Is it possible that Scala 2.10 vs Scala 2.11 makes a difference? 
There are a lot of variables at play unfortunately.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Miller updated SPARK-17936:
--
Comment: was deleted

(was: I did look through them and I don't think they're related. Note that the 
error is different and this is trying to write data not read large amounts of 
data.)

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575471#comment-15575471
 ] 

Justin Miller commented on SPARK-17936:
---

I'd also note this wasn't an issue in Spark 1.6.2. The process would run fine 
for hours and never crashed on this error.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575452#comment-15575452
 ] 

Justin Miller commented on SPARK-17936:
---

I did look through them and I don't think they're related. Note that the error 
is different and this is trying to write data not read large amounts of data.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)
Justin Miller created SPARK-17936:
-

 Summary: "CodeGenerator - failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of" method Error
 Key: SPARK-17936
 URL: https://issues.apache.org/jira/browse/SPARK-17936
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Justin Miller


Greetings. I'm currently in the process of migrating a project I'm working on 
from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
structs coming from Kafka into Parquet files stored in S3. This conversion 
process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll paste 
the stack trace below.

org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)

Also, later on:
07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
in thread Thread[Executor task launch worker-6,5,run-main-group-0]
java.lang.OutOfMemoryError: Java heap space

I've seen similar issues posted, but those were always on the query side. I 
have a hunch that this is happening at write time as the error occurs after 
batchDuration. Here's the write snippet.

stream.
  flatMap {
case Success(row) =>
  thriftParseSuccess += 1
  Some(row)
case Failure(ex) =>
  thriftParseErrors += 1
  logger.error("Error during deserialization: ", ex)
  None
  }.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.context)
transformer(sqlContext.createDataFrame(rdd, converter.schema))
  .coalesce(coalesceSize)
  .write
  .mode(Append)
  .partitionBy(partitioning: _*)
  .parquet(parquetPath)
  }

Please let me know if you can be of assistance and if there's anything I can do 
to help.

Best,
Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org