[jira] [Updated] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty
[ https://issues.apache.org/jira/browse/SPARK-18787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18787: - Description: The documentation for the configuration spark.shuffle.io.preferDirectBufs states that it will force all allocations from Netty to be on-heap but this currently does not happen. The reason is that preferDirect parameter of Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. In order to completely stop netty from using off heap memory, we need to set the following system properties: - io.netty.noUnsafe=true - io.netty.threadLocalDirectBufferSize=0 The proposal is to set properties (using System.setProperties) when the executor starts (before any of the Netty classes load) or document these properties to hint users on how to completely eliminate Netty' off heap footprint. was: The documentation for the configuration spark.shuffle.io.preferDirectBufs states that it will force all allocations from Netty to be on-heap but this currently does not happen. The reason is that preferDirect of Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. In order to completely stop netty from using off heap memory, we need to set the following system properties: - io.netty.noUnsafe=true - io.netty.threadLocalDirectBufferSize=0 The proposal is to set properties (using System.setProperties) when the executor starts (before any of the Netty classes load) or document these properties to hint users on how to completely eliminate Netty' off heap footprint. > spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer > usage by Netty > --- > > Key: SPARK-18787 > URL: https://issues.apache.org/jira/browse/SPARK-18787 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Aniket Bhatnagar > > The documentation for the configuration spark.shuffle.io.preferDirectBufs > states that it will force all allocations from Netty to be on-heap but this > currently does not happen. The reason is that preferDirect parameter of > Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap > by Netty. In order to completely stop netty from using off heap memory, we > need to set the following system properties: > - io.netty.noUnsafe=true > - io.netty.threadLocalDirectBufferSize=0 > The proposal is to set properties (using System.setProperties) when the > executor starts (before any of the Netty classes load) or document these > properties to hint users on how to completely eliminate Netty' off heap > footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty
[ https://issues.apache.org/jira/browse/SPARK-18787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18787: - Description: The documentation for the configuration spark.shuffle.io.preferDirectBufs states that it will force all allocations from Netty to be on-heap but this currently does not happen. The reason is that preferDirect of Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. In order to completely stop netty from using off heap memory, we need to set the following system properties: - io.netty.noUnsafe=true - io.netty.threadLocalDirectBufferSize=0 The proposal is to set properties (using System.setProperties) when the executor starts (before any of the Netty classes load) or document these properties to hint users on how to completely eliminate Netty' off heap footprint. was: The documentation for the configuration spark.shuffle.io.preferDirectBufs states that it will force all allocations from Netty to be on-heap but this currently does not happen. The reason is that preferDirect of Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. In order to completely stop netty from using off heap memory, we need to set the following system properties: - io.netty.noUnsafe=true - io.netty.threadLocalDirectBufferSize=0 The proposal is to set properties (using System.setProperties) when the executor starts (before any of the Netty classes load) or document these properties to hint users on how to completely eliminate off Netty' heap footprint. > spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer > usage by Netty > --- > > Key: SPARK-18787 > URL: https://issues.apache.org/jira/browse/SPARK-18787 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Aniket Bhatnagar > > The documentation for the configuration spark.shuffle.io.preferDirectBufs > states that it will force all allocations from Netty to be on-heap but this > currently does not happen. The reason is that preferDirect of Netty's > PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. > In order to completely stop netty from using off heap memory, we need to set > the following system properties: > - io.netty.noUnsafe=true > - io.netty.threadLocalDirectBufferSize=0 > The proposal is to set properties (using System.setProperties) when the > executor starts (before any of the Netty classes load) or document these > properties to hint users on how to completely eliminate Netty' off heap > footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty
Aniket Bhatnagar created SPARK-18787: Summary: spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty Key: SPARK-18787 URL: https://issues.apache.org/jira/browse/SPARK-18787 Project: Spark Issue Type: Bug Affects Versions: 2.0.2 Reporter: Aniket Bhatnagar The documentation for the configuration spark.shuffle.io.preferDirectBufs states that it will force all allocations from Netty to be on-heap but this currently does not happen. The reason is that preferDirect of Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. In order to completely stop netty from using off heap memory, we need to set the following system properties: - io.netty.noUnsafe=true - io.netty.threadLocalDirectBufferSize=0 The proposal is to set properties (using System.setProperties) when the executor starts (before any of the Netty classes load) or document these properties to hint users on how to completely eliminate off Netty' heap footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492 ] Aniket Bhatnagar edited comment on SPARK-18251 at 11/13/16 5:17 PM: Hi [~jayadevan.m] Which version of scala and spark did you use? I can reproduce this on spark 2.0.1 and scala 2.11.8. I have created a sample project with all the dependencies to easily reproduce this: https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug To reproduce the bug, simple checkout the project and run the command sbt run. Thanks, Aniket was (Author: aniket): Hi [~jayadevan.m] Which version of scala spark did you use? I can reproduce this on spark 2.0.1 and scala 2.11.8. I have created a sample project with all the dependencies to easily reproduce this: https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug To reproduce the bug, simple checkout the project and run the command sbt run. Thanks, Aniket > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18421) Dynamic disk allocation
[ https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661760#comment-15661760 ] Aniket Bhatnagar commented on SPARK-18421: -- I agree that spark doesn't manage the storage and therefore, running of an agent and dynamic addition of storage to a host is outside the scope. However, what's in scope for spark is ability for spark to use added storage without forcing restart of executor process. Specifically, spark.local.dirs needs to be a dynamic property. For example, spark.local.dirs could be configured as a glob pattern (something like /mnt*) and whenever a new disk is added & mounted (as /mnt), spark's shuffle service should be able to use the locally added disk. Additionally, there maybe a task to rebalance shuffle blocks once a disk is added so that all local dirs are once again used equally. I don't think, detection of newly mounted directory, rebalancing of blocks, etc is cloud specific as all of this can be done using java's IO/NIO api. This feature would however be mostly useful for users running in spark on cloud. Currently, the users are expected to guess their shuffle storage footprint and accordingly mount the right sized disks. If the guess is wrong, the job fails, wasting a lot of time. > Dynamic disk allocation > --- > > Key: SPARK-18421 > URL: https://issues.apache.org/jira/browse/SPARK-18421 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > Dynamic allocation feature allows you to add executors and scale computation > power. This is great, however, I feel like we also need a way to dynamically > scale storage. Currently, if the disk is not able to hold the spilled/shuffle > data, the job is aborted (in yarn, the node manager kills the container) > causing frustration and loss of time. In deployments like AWS EMR, it is > possible to run an agent that add disks on the fly if it sees that the disks > are running out of space and it would be great if Spark could immediately > start using the added disks just as it does when new executors are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18421) Dynamic disk allocation
Aniket Bhatnagar created SPARK-18421: Summary: Dynamic disk allocation Key: SPARK-18421 URL: https://issues.apache.org/jira/browse/SPARK-18421 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.0.1 Reporter: Aniket Bhatnagar Priority: Minor Dynamic allocation feature allows you to add executors and scale computation power. This is great, however, I feel like we also need a way to dynamically scale storage. Currently, if the disk is not able to hold the spilled/shuffle data, the job is aborted (in yarn, the node manager kills the container) causing frustration and loss of time. In deployments like AWS EMR, it is possible to run an agent that add disks on the fly if it sees that the disks are running out of space and it would be great if Spark could immediately start using the added disks just as it does when new executors are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492 ] Aniket Bhatnagar commented on SPARK-18251: -- Hi [~jayadevan.m] Which version of scala spark did you use? I can reproduce this on spark 2.0.1 and scala 2.11.8. I have created a sample project with all the dependencies to easily reproduce this: https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug To reproduce the bug, simple checkout the project and run the command sbt run. Thanks, Aniket > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar closed SPARK-18273. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar resolved SPARK-18273. -- Resolution: Not A Problem Glob patterns can be passed instead of full paths to reduce the numbers of paths passed in to load method. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637731#comment-15637731 ] Aniket Bhatnagar commented on SPARK-18273: -- Thanks [~srowen]. Didn't realize that I could actually pass glob pattern. Thank you so much. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
Aniket Bhatnagar created SPARK-18273: Summary: DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass Key: SPARK-18273 URL: https://issues.apache.org/jira/browse/SPARK-18273 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.1 Reporter: Aniket Bhatnagar If the paths Seq parameter contains a lot of elements, then DataFrameReader.load takes a lot of time starting the job as it attempts to check if each of the path exists using fs.exists. There should be a boolean configuration option to disable the checking for path's existence and that should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: {noformat} Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ''' Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ''' The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ''' Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ''' The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Description: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The bug can be reproduce by using the program: https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e was: I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I am attaching a sample program that can be used to reproduce this bug. > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > >
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Summary: DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class (was: DataSet | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class) > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I am attaching a sample program that can be used to reproduce this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18251) DataSet RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Summary: DataSet RuntimeException: Null value appeared in non-nullable field when holding Option Case Class (was: RuntimeException: Null value appeared in non-nullable field when holding Option Case Class) > DataSet RuntimeException: Null value appeared in non-nullable field when > holding Option Case Class > -- > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I am attaching a sample program that can be used to reproduce this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18251) DataSet | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-18251: - Summary: DataSet | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class (was: DataSet RuntimeException: Null value appeared in non-nullable field when holding Option Case Class) > DataSet | RuntimeException: Null value appeared in non-nullable field when > holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I am attaching a sample program that can be used to reproduce this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18251) RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
Aniket Bhatnagar created SPARK-18251: Summary: RuntimeException: Null value appeared in non-nullable field when holding Option Case Class Key: SPARK-18251 URL: https://issues.apache.org/jira/browse/SPARK-18251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Environment: OS X Reporter: Aniket Bhatnagar I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot hold Empty. If it does so, the following exception is thrown: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - option value class: "DataSetOptBug.DataRow" - root class: "scala.Option" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I am attaching a sample program that can be used to reproduce this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
Aniket Bhatnagar created SPARK-8895: --- Summary: MetricsSystem.removeSource not called in StreamingContext.stop Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8896) StreamingSource should choose a unique name
Aniket Bhatnagar created SPARK-8896: --- Summary: StreamingSource should choose a unique name Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8896: Description: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] was: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
[ https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8895: Description: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: ?? [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] ?? was: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {{ [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] }} MetricsSystem.removeSource not called in StreamingContext.stop -- Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: ?? [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] ?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8896: Description: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} was: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
[ https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8895: Description: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {quote} [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} was: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] MetricsSystem.removeSource not called in StreamingContext.stop -- Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {quote} [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618764#comment-14618764 ] Aniket Bhatnagar commented on SPARK-8896: - As per the documentation (scala docs), only one SparkContext per JVM is allowed. However, no such documentation exists for StreamingContext. SparkContext makes an effort to ensure only one SparkContext instance exists in the JVM using contextBeingConstructed variable. However, no such effort is made in StreamingContext. This lead me to believe that multiple StreamingContexts are allowed in a JVM. I can understand why only one SparkContext is allowed per JVM (global state, et al), but I don't think thats true for StreamingContext. There maybe genuine use cases to have 2 StreamingContext instances using the same SparkContext to leverage the same workers and have different batch intervals. StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
Aniket Bhatnagar created SPARK-7788: --- Summary: Streaming | Kinesis | KinesisReceiver blocks in onStart Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.3.0 Reporter: Aniket Bhatnagar KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537766#comment-14537766 ] Aniket Bhatnagar commented on SPARK-6154: - Yes. I haven't tested this approach with scala 2.11 though but I believe it should work. It is also possible to make autocompletion to work if we could write an adapter between old jline reader and new one. Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528089#comment-14528089 ] Aniket Bhatnagar commented on SPARK-6154: - I ran into this as well and my fix was to update org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver to use latest jline and comment out the line 'CliDriver.getCommandCompletor.foreach((e) = reader.addCompletor(e))'. Yes, you do loose autocompletion but JDBC (HiveThriftServer2) works fine. What should be proper fix for this? Perhaps dropping the autocomplete feature? Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322518#comment-14322518 ] Aniket Bhatnagar commented on SPARK-3638: - Did you build spark with kinesis-asl profile? The standard distribution does not have this profile and therefore you would have to roll your won as described in https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md (mvn -Pkinesis-asl -DskipTests clean package). Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Fix For: 1.1.1, 1.2.0 Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: {code} Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) {code} I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293019#comment-14293019 ] Aniket Bhatnagar commented on SPARK-2243: - I am also interested in having this fixed. Can someone please outline what are specific things that need to be fixed to make this work so that interested people can contribute? Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at
[jira] [Created] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues
Aniket Bhatnagar created SPARK-5330: --- Summary: Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues Key: SPARK-5330 URL: https://issues.apache.org/jira/browse/SPARK-5330 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. Users of jackson-module-scala had to to depend on the same version to avoid any class compatibility issues. However, since scala 2.11, jackson-module-scala is no longer published for version 2.3.1. Since the version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest jackson-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283432#comment-14283432 ] Aniket Bhatnagar commented on SPARK-5330: - One possible workaround is do define the dependency as follows : com.fasterxml.jackson.module % jackson-module-scala_2.10 % 2.3.1 excludeAll( ExclusionRule(organization = org.scala-lang), ExclusionRule(organization = org.scalatest) ) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues --- Key: SPARK-5330 URL: https://issues.apache.org/jira/browse/SPARK-5330 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. Users of jackson-module-scala had to to depend on the same version to avoid any class compatibility issues. However, since scala 2.11, jackson-module-scala is no longer published for version 2.3.1. Since the version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest jackson-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar resolved SPARK-5164. - Resolution: Duplicate Duplicates and has similar findings to SPARK-1825. YARN | Spark job submits from windows machine to a linux YARN cluster fail -- Key: SPARK-5164 URL: https://issues.apache.org/jira/browse/SPARK-5164 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Spark submit from Windows 7 YARN cluster on CentOS 6.5 Reporter: Aniket Bhatnagar While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
Aniket Bhatnagar created SPARK-5164: --- Summary: YARN | Spark job submits from windows machine to a linux YARN cluster fail Key: SPARK-5164 URL: https://issues.apache.org/jira/browse/SPARK-5164 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Spark submit from Windows 7 YARN cluster on CentOS 6.5 Reporter: Aniket Bhatnagar While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of Path.SEPARATOR in ClientBase and YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-5164: Description: While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in ClientBase and YarnSparkHadoopUtil. was: While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of Path.SEPARATOR in ClientBase and YarnSparkHadoopUtil. YARN | Spark job submits from windows machine to a linux YARN cluster fail -- Key: SPARK-5164 URL: https://issues.apache.org/jira/browse/SPARK-5164 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Spark submit from Windows 7 YARN cluster on CentOS 6.5 Reporter: Aniket Bhatnagar While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in ClientBase and YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-5164: Description: While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in YarnSparkHadoopUtil. was: While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in ClientBase and YarnSparkHadoopUtil. YARN | Spark job submits from windows machine to a linux YARN cluster fail -- Key: SPARK-5164 URL: https://issues.apache.org/jira/browse/SPARK-5164 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Spark submit from Windows 7 YARN cluster on CentOS 6.5 Reporter: Aniket Bhatnagar While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of File.pathSeparator in YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail
[ https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270533#comment-14270533 ] Aniket Bhatnagar commented on SPARK-5164: - First issue can be fixed by using Environment.variable.$$() instead of Environment.variable.$() in ClientBase. But unfortunately, $$() method seems to be only added in recent versions of hadoop making it not a viable option if we want to support many versions of Hadoop. I am not sure if it is possible to detect remote OS using YARN API. I am thinking that perhaps we should introduce a new configuration - spark.yarn.remote.os that hints about the target YARN OS an can take values - Windows or Linux. We can then use this configuration in ClientBase and Path.SEPARATOR. I am happy to submit a pull request for this, once the recommendation is vetted by the community. YARN | Spark job submits from windows machine to a linux YARN cluster fail -- Key: SPARK-5164 URL: https://issues.apache.org/jira/browse/SPARK-5164 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Spark submit from Windows 7 YARN cluster on CentOS 6.5 Reporter: Aniket Bhatnagar While submitting spark jobs from a windows machine to a linux YARN cluster, the jobs fail because of the following reasons: 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of linux's syntax ($JAVA_HOME, $PWD, etc). 2. Paths in launch environment are delimited by semi-colon instead of colon. This is because of usage of Path.SEPARATOR in ClientBase and YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5143) spark-network-yarn 2.11 depends on spark-network-shuffle 2.10
Aniket Bhatnagar created SPARK-5143: --- Summary: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10 Key: SPARK-5143 URL: https://issues.apache.org/jira/browse/SPARK-5143 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar Priority: Critical It seems that spark-network-yarn compiled for scala 2.11 depends on spark-network-shuffle compiled for scala 2.10. This causes builds with SBT 0.13.7 to fail with the error Conflicting cross-version suffixes. Screenshot of dependency: http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5144) spark-yarn module should be published
Aniket Bhatnagar created SPARK-5144: --- Summary: spark-yarn module should be published Key: SPARK-5144 URL: https://issues.apache.org/jira/browse/SPARK-5144 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar We disabled publishing of certain modules in SPARK-3452. One of such modules is spark-yarn. This breaks applications that submit spark jobs programatically with master set as yarn-client. This is because SparkContext is dependent on classes from yarn-client module to submit the YARN application. Here is the stack trace that you get if you submit the spark job without yarn-client dependency: 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryStore started with capacity 731.7 MB Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at com.myimpl.Server:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 27 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195) ... 29 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269145#comment-14269145 ] Aniket Bhatnagar commented on SPARK-3452: - I have opened another defect - SPARK-5144 suggesting yarn-client to be published. Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267469#comment-14267469 ] Aniket Bhatnagar commented on SPARK-3452: - Here is the exception I am getting while triggering a job that contains SparkContext having master as yarn-client. A quick look at 1.2.0 source code suggests I should depend on spark-yarn module which I can't as it is not longer published. Do you want me to log a separate defect for this and submit appropriate pull request? 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryS tore started with capacity 731.7 MB Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at com.myimpl.Server:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 27 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195) ... 29 more Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266156#comment-14266156 ] Aniket Bhatnagar commented on SPARK-3452: - Ok.. I'll test this out by adding dependency to spark-network-yarn and see how it goes. Fingers crossed! Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265749#comment-14265749 ] Aniket Bhatnagar commented on SPARK-3452: - I would like this to be revisited. The issue I am facing is that while people may not dependent on some modules during compile time but they may dependent on them during runtime. For example, I am building a spark server that lets users submit spark jobs using convenient REST endpoints. This used to work great even in yarn-client mode. However, once I migrate to 1.2.0, this breaks because I can no longer add dependency of my spark server to spark-yarn module which is used while submitting jobs to YARN cluster. Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4862) Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error
Aniket Bhatnagar created SPARK-4862: --- Summary: Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error Key: SPARK-4862 URL: https://issues.apache.org/jira/browse/SPARK-4862 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.1, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor If the checkpoint is set as a local filesystem directory, it results in weird error messages like the following: org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[467] at apply at List.scala:318(0) has different number of partitions than original RDD MapPartitionsRDD[461] at mapPartitions at StateDStream.scala:71(56) It would be great if Spark could output better error message that better hints at what could have gone wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar closed SPARK-3640. --- Resolution: Not a Problem Tested and Chris's suggestion of using EC2 IAM instance profile works fine. KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider - Key: SPARK-3640 URL: https://issues.apache.org/jira/browse/SPARK-3640 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: kinesis KinesisUtils should accept AWS Credentials as a parameter and should default to DefaultCredentialsProvider if no credentials are provided. Currently, the implementation forces usage of DefaultCredentialsProvider which can be a pain especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME
[ https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239060#comment-14239060 ] Aniket Bhatnagar commented on SPARK-1350: - I am using hadoop 2.5.0 (CDH). Agreed that it handles for windows. But the use case I am talking about is when SparkContext is created programmatically on a windows machine and is used to submit jobs on a yarn cluster running on Linux. As per above code, %JAVA_HOME%/bin/java will be generated as one of the commands by ClientBase and submitted to YARN cluster. This will obviously fail while YARN tries to execute the container as % is treated differently in linux. YARN ContainerLaunchContext should use cluster's JAVA_HOME -- Key: SPARK-1350 URL: https://issues.apache.org/jira/browse/SPARK-1350 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.0.0 {code} var javaCommand = java val javaHome = System.getenv(JAVA_HOME) if ((javaHome != null !javaHome.isEmpty()) || env.isDefinedAt(JAVA_HOME)) { javaCommand = Environment.JAVA_HOME.$() + /bin/java } {code} Currently, if JAVA_HOME is specified on the client, it will be used instead of the value given on the cluster. This makes it so that Java must be installed in the same place on the client as on the cluster. This is a possibly incompatible change that we should get in before 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME
[ https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237525#comment-14237525 ] Aniket Bhatnagar commented on SPARK-1350: - [~sandyr] Using Environment.JAVA_HOME.$() causes issues while submitting spark applications from a windows box into a Yarn cluster running on Linux (with spark master set as yarn-client). This is because Environment.JAVA_HOME.$() resolves to %JAVA_HOME% which results in not a valid executable on Linux. Is this a Spark issue or a YARN issue? YARN ContainerLaunchContext should use cluster's JAVA_HOME -- Key: SPARK-1350 URL: https://issues.apache.org/jira/browse/SPARK-1350 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.0.0 {code} var javaCommand = java val javaHome = System.getenv(JAVA_HOME) if ((javaHome != null !javaHome.isEmpty()) || env.isDefinedAt(JAVA_HOME)) { javaCommand = Environment.JAVA_HOME.$() + /bin/java } {code} Currently, if JAVA_HOME is specified on the client, it will be used instead of the value given on the cluster. This makes it so that Java must be installed in the same place on the client as on the cluster. This is a possibly incompatible change that we should get in before 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232719#comment-14232719 ] Aniket Bhatnagar commented on SPARK-3638: - Yes. You may want to open another JIRA ticket for having kinesis pre build packages in spark download page. Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Fix For: 1.1.1, 1.2.0 Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: {code} Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) {code} I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231086#comment-14231086 ] Aniket Bhatnagar commented on SPARK-3638: - [~ashrafuzzaman] did you build using using kinesis-asl profile as described in https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md (mvn -Pkinesis-asl -DskipTests clean package)? None of the pre build spark distributions have the profile enabled. I just pulled down 1.1.1 source code and did a build as mentioned in the Kinesis integration documentation and I can see the class the in the built assembly. Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Fix For: 1.1.1, 1.2.0 Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: {code} Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) {code} I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218063#comment-14218063 ] Aniket Bhatnagar commented on SPARK-2321: - Just a quick question, will the API be usable from job submitter process in yarn-cluster mode (i.e. when the driver is running as a separate YARN process?)? Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Priority: Critical Fix For: 1.2.0 This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4473) [Core] StageInfo should have ActiveJob's group ID as a field
Aniket Bhatnagar created SPARK-4473: --- Summary: [Core] StageInfo should have ActiveJob's group ID as a field Key: SPARK-4473 URL: https://issues.apache.org/jira/browse/SPARK-4473 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aniket Bhatnagar Priority: Minor It would be convenient to have active job's group ID in StageInfo so that JobProgressListener can be used to track specific job/jobs in a group. Perhaps, stage's group ID can also be shown in default Spark's UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196534#comment-14196534 ] Aniket Bhatnagar commented on SPARK-3640: - Thanks Chris for looking into this. This documentation would certainly be useful. However, the problem that I am facing with using DefaultCredentialsProvider is that each node in the cluster needs to be setup to have those credentials in the user's home directory and this is a bit tedious. I would like the driver to pass credentials to all nodes in the cluster to avoid such an operational overhead. I have submitted a pull request that contains the changes I had to make to allow driver to pass user defined credentials. Do have a look and let me know if there is a better way. https://github.com/apache/spark/pull/3092 KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider - Key: SPARK-3640 URL: https://issues.apache.org/jira/browse/SPARK-3640 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: kinesis KinesisUtils should accept AWS Credentials as a parameter and should default to DefaultCredentialsProvider if no credentials are provided. Currently, the implementation forces usage of DefaultCredentialsProvider which can be a pain especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
Aniket Bhatnagar created SPARK-3638: --- Summary: Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3639) Kinesis examples set master as local
Aniket Bhatnagar created SPARK-3639: --- Summary: Kinesis examples set master as local Key: SPARK-3639 URL: https://issues.apache.org/jira/browse/SPARK-3639 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0, 1.0.2 Reporter: Aniket Bhatnagar Priority: Minor Kinesis examples set master as local thus not allowing the example to be tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-3638: Component/s: Streaming Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3639) Kinesis examples set master as local
[ https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143108#comment-14143108 ] Aniket Bhatnagar commented on SPARK-3639: - If the community agrees this is an issue, I can submit a PR for this Kinesis examples set master as local Key: SPARK-3639 URL: https://issues.apache.org/jira/browse/SPARK-3639 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor Labels: examples Kinesis examples set master as local thus not allowing the example to be tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
Aniket Bhatnagar created SPARK-3640: --- Summary: KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider Key: SPARK-3640 URL: https://issues.apache.org/jira/browse/SPARK-3640 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar KinesisUtils should accept AWS Credentials as a parameter and should default to DefaultCredentialsProvider if no credentials are provided. Currently, the implementation forces usage of DefaultCredentialsProvider which can be a pain especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143111#comment-14143111 ] Aniket Bhatnagar commented on SPARK-3640: - I understand that the credentials need to be serializable so that they are transmitted to workers. I can put in a PR request to make this possible. KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider - Key: SPARK-3640 URL: https://issues.apache.org/jira/browse/SPARK-3640 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: kinesis KinesisUtils should accept AWS Credentials as a parameter and should default to DefaultCredentialsProvider if no credentials are provided. Currently, the implementation forces usage of DefaultCredentialsProvider which can be a pain especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org