[jira] [Updated] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty

2016-12-09 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18787:
-
Description: 
The documentation for the configuration spark.shuffle.io.preferDirectBufs 
states that it will force all allocations from Netty to be on-heap but this 
currently does not happen. The reason is that preferDirect parameter of Netty's 
PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
In order to completely stop netty from using off heap memory, we need to set 
the following system properties:
- io.netty.noUnsafe=true
- io.netty.threadLocalDirectBufferSize=0

The proposal is to set properties (using System.setProperties) when the 
executor starts (before any of the Netty classes load) or document these 
properties to hint users on how to completely eliminate Netty' off heap 
footprint.

  was:
The documentation for the configuration spark.shuffle.io.preferDirectBufs 
states that it will force all allocations from Netty to be on-heap but this 
currently does not happen. The reason is that preferDirect of Netty's 
PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
In order to completely stop netty from using off heap memory, we need to set 
the following system properties:
- io.netty.noUnsafe=true
- io.netty.threadLocalDirectBufferSize=0

The proposal is to set properties (using System.setProperties) when the 
executor starts (before any of the Netty classes load) or document these 
properties to hint users on how to completely eliminate Netty' off heap 
footprint.


> spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer 
> usage by Netty
> ---
>
> Key: SPARK-18787
> URL: https://issues.apache.org/jira/browse/SPARK-18787
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Aniket Bhatnagar
>
> The documentation for the configuration spark.shuffle.io.preferDirectBufs 
> states that it will force all allocations from Netty to be on-heap but this 
> currently does not happen. The reason is that preferDirect parameter of 
> Netty's PooledByteBufAllocator doesn't completely eliminate use of off heap 
> by Netty. In order to completely stop netty from using off heap memory, we 
> need to set the following system properties:
> - io.netty.noUnsafe=true
> - io.netty.threadLocalDirectBufferSize=0
> The proposal is to set properties (using System.setProperties) when the 
> executor starts (before any of the Netty classes load) or document these 
> properties to hint users on how to completely eliminate Netty' off heap 
> footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty

2016-12-09 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18787:
-
Description: 
The documentation for the configuration spark.shuffle.io.preferDirectBufs 
states that it will force all allocations from Netty to be on-heap but this 
currently does not happen. The reason is that preferDirect of Netty's 
PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
In order to completely stop netty from using off heap memory, we need to set 
the following system properties:
- io.netty.noUnsafe=true
- io.netty.threadLocalDirectBufferSize=0

The proposal is to set properties (using System.setProperties) when the 
executor starts (before any of the Netty classes load) or document these 
properties to hint users on how to completely eliminate Netty' off heap 
footprint.

  was:
The documentation for the configuration spark.shuffle.io.preferDirectBufs 
states that it will force all allocations from Netty to be on-heap but this 
currently does not happen. The reason is that preferDirect of Netty's 
PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
In order to completely stop netty from using off heap memory, we need to set 
the following system properties:
- io.netty.noUnsafe=true
- io.netty.threadLocalDirectBufferSize=0

The proposal is to set properties (using System.setProperties) when the 
executor starts (before any of the Netty classes load) or document these 
properties to hint users on how to completely eliminate off Netty' heap 
footprint.


> spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer 
> usage by Netty
> ---
>
> Key: SPARK-18787
> URL: https://issues.apache.org/jira/browse/SPARK-18787
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Aniket Bhatnagar
>
> The documentation for the configuration spark.shuffle.io.preferDirectBufs 
> states that it will force all allocations from Netty to be on-heap but this 
> currently does not happen. The reason is that preferDirect of Netty's 
> PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
> In order to completely stop netty from using off heap memory, we need to set 
> the following system properties:
> - io.netty.noUnsafe=true
> - io.netty.threadLocalDirectBufferSize=0
> The proposal is to set properties (using System.setProperties) when the 
> executor starts (before any of the Netty classes load) or document these 
> properties to hint users on how to completely eliminate Netty' off heap 
> footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty

2016-12-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-18787:


 Summary: spark.shuffle.io.preferDirectBufs does not completely 
turn off direct buffer usage by Netty
 Key: SPARK-18787
 URL: https://issues.apache.org/jira/browse/SPARK-18787
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.2
Reporter: Aniket Bhatnagar


The documentation for the configuration spark.shuffle.io.preferDirectBufs 
states that it will force all allocations from Netty to be on-heap but this 
currently does not happen. The reason is that preferDirect of Netty's 
PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
In order to completely stop netty from using off heap memory, we need to set 
the following system properties:
- io.netty.noUnsafe=true
- io.netty.threadLocalDirectBufferSize=0

The proposal is to set properties (using System.setProperties) when the 
executor starts (before any of the Netty classes load) or document these 
properties to hint users on how to completely eliminate off Netty' heap 
footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-13 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492
 ] 

Aniket Bhatnagar edited comment on SPARK-18251 at 11/13/16 5:17 PM:


Hi [~jayadevan.m]

Which version of scala and spark did you use? I can reproduce this on spark 
2.0.1 and scala 2.11.8. I have created a sample project with all the 
dependencies to easily reproduce this:
https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug
To reproduce the bug, simple checkout the project and run the command sbt run.

Thanks,
Aniket


was (Author: aniket):
Hi [~jayadevan.m]

Which version of scala spark did you use? I can reproduce this on spark 2.0.1 
and scala 2.11.8. I have created a sample project with all the dependencies to 
easily reproduce this:
https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug
To reproduce the bug, simple checkout the project and run the command sbt run.

Thanks,
Aniket

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18421) Dynamic disk allocation

2016-11-13 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661760#comment-15661760
 ] 

Aniket Bhatnagar commented on SPARK-18421:
--

I agree that spark doesn't manage the storage and therefore, running of an 
agent and dynamic addition of storage to a host is outside the scope. However, 
what's in scope for spark is ability for spark to use added storage without 
forcing restart of executor process. Specifically, spark.local.dirs needs to be 
a dynamic property. For example, spark.local.dirs could be configured as a glob 
pattern (something like /mnt*) and whenever a new disk is added & mounted (as 
/mnt), spark's shuffle service should be able to use the locally 
added disk. Additionally, there maybe a task to rebalance shuffle blocks once a 
disk is added so that all local dirs are once again used equally. 

I don't think, detection of newly mounted directory, rebalancing of blocks, etc 
is cloud specific as all of this can be done using java's IO/NIO api.

This feature would however be mostly useful for users running in spark on 
cloud. Currently, the users are expected to guess their shuffle storage 
footprint and accordingly mount the right sized disks. If the guess is wrong, 
the job fails, wasting a lot of time.

> Dynamic disk allocation
> ---
>
> Key: SPARK-18421
> URL: https://issues.apache.org/jira/browse/SPARK-18421
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> Dynamic allocation feature allows you to add executors and scale computation 
> power. This is great, however, I feel like we also need a way to dynamically 
> scale storage. Currently, if the disk is not able to hold the spilled/shuffle 
> data, the job is aborted (in yarn, the node manager kills the container) 
> causing frustration and loss of time. In deployments like AWS EMR, it is 
> possible to run an agent that add disks on the fly if it sees that the disks 
> are running out of space and it would be great if Spark could immediately 
> start using the added disks just as it does when new executors are added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18421) Dynamic disk allocation

2016-11-12 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-18421:


 Summary: Dynamic disk allocation
 Key: SPARK-18421
 URL: https://issues.apache.org/jira/browse/SPARK-18421
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Aniket Bhatnagar
Priority: Minor


Dynamic allocation feature allows you to add executors and scale computation 
power. This is great, however, I feel like we also need a way to dynamically 
scale storage. Currently, if the disk is not able to hold the spilled/shuffle 
data, the job is aborted (in yarn, the node manager kills the container) 
causing frustration and loss of time. In deployments like AWS EMR, it is 
possible to run an agent that add disks on the fly if it sees that the disks 
are running out of space and it would be great if Spark could immediately start 
using the added disks just as it does when new executors are added.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-11 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492
 ] 

Aniket Bhatnagar commented on SPARK-18251:
--

Hi [~jayadevan.m]

Which version of scala spark did you use? I can reproduce this on spark 2.0.1 
and scala 2.11.8. I have created a sample project with all the dependencies to 
easily reproduce this:
https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug
To reproduce the bug, simple checkout the project and run the command sbt run.

Thanks,
Aniket

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar closed SPARK-18273.


> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar resolved SPARK-18273.
--
Resolution: Not A Problem

Glob patterns can be passed instead of full paths to reduce the numbers of 
paths passed in to load method.

> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637731#comment-15637731
 ] 

Aniket Bhatnagar commented on SPARK-18273:
--

Thanks [~srowen]. Didn't realize that I could actually pass glob pattern. Thank 
you so much.

> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-18273:


 Summary: DataFrameReader.load takes a lot of time to start the job 
if a lot of file/dir paths are pass 
 Key: SPARK-18273
 URL: https://issues.apache.org/jira/browse/SPARK-18273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Aniket Bhatnagar


If the paths Seq parameter contains a lot of elements, then 
DataFrameReader.load takes a lot of time starting the job as it attempts to 
check if each of the path exists using fs.exists. There should be a boolean 
configuration  option to disable the checking for path's existence and that 
should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

{noformat}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

'''
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
'''

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

'''
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
'''

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

```
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

```
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

```
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

```
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Description: 
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


The bug can be reproduce by using the program: 
https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e

  was:
I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I am attaching a sample program that can be used to reproduce this bug.



> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
>

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Summary: DataSet API | RuntimeException: Null value appeared in 
non-nullable field when holding Option Case Class  (was: DataSet | 
RuntimeException: Null value appeared in non-nullable field when holding Option 
Case Class)

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I am attaching a sample program that can be used to reproduce this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18251) DataSet RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Summary: DataSet RuntimeException: Null value appeared in non-nullable 
field when holding Option Case Class  (was: RuntimeException: Null value 
appeared in non-nullable field when holding Option Case Class)

> DataSet RuntimeException: Null value appeared in non-nullable field when 
> holding Option Case Class
> --
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I am attaching a sample program that can be used to reproduce this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18251) DataSet | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-18251:
-
Summary: DataSet | RuntimeException: Null value appeared in non-nullable 
field when holding Option Case Class  (was: DataSet RuntimeException: Null 
value appeared in non-nullable field when holding Option Case Class)

> DataSet | RuntimeException: Null value appeared in non-nullable field when 
> holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I am attaching a sample program that can be used to reproduce this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18251) RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-03 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-18251:


 Summary: RuntimeException: Null value appeared in non-nullable 
field when holding Option Case Class
 Key: SPARK-18251
 URL: https://issues.apache.org/jira/browse/SPARK-18251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
 Environment: OS X
Reporter: Aniket Bhatnagar


I am running into a runtime exception when a DataSet is holding an Empty object 
instance for an Option type that is holding non-nullable field. For instance, 
if we have the following case class:

case class DataRow(id: Int, value: String)

Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
hold Empty. If it does so, the following exception is thrown:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: Lost 
task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: Null 
value appeared in non-nullable field:
- field (class: "scala.Int", name: "id")
- option value class: "DataSetOptBug.DataRow"
- root class: "scala.Option"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I am attaching a sample program that can be used to reproduce this bug.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop

2015-07-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-8895:
---

 Summary: MetricsSystem.removeSource not called in 
StreamingContext.stop
 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor


StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8896) StreamingSource should choose a unique name

2015-07-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-8896:
---

 Summary: StreamingSource should choose a unique name
 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar


If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name

2015-07-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8896:

Description: 
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.


[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]


  was:
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

{quote}
[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}


 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop

2015-07-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8895:

Description: 
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

??
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
??

  was:
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

{{
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
}}


 MetricsSystem.removeSource not called in StreamingContext.stop
 --

 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor

 StreamingContext calls env.metricsSystem.registerSource during its 
 construction but does not call env.metricsSystem.removeSource. Therefore, if 
 a user attempts to restart a Streaming job in the same JVM by creating a new 
 instance of StreamingContext with the same application name, it results in 
 exceptions like the following in the log:
 ??
 [info] o.a.s.m.MetricsSystem -Metrics already registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 ??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name

2015-07-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8896:

Description: 
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

{quote}
[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}

  was:
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]



 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 {quote}
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop

2015-07-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8895:

Description: 
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

{quote}
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}

  was:
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]



 MetricsSystem.removeSource not called in StreamingContext.stop
 --

 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor

 StreamingContext calls env.metricsSystem.registerSource during its 
 construction but does not call env.metricsSystem.removeSource. Therefore, if 
 a user attempts to restart a Streaming job in the same JVM by creating a new 
 instance of StreamingContext with the same application name, it results in 
 exceptions like the following in the log:
 {quote}
 [info] o.a.s.m.MetricsSystem -Metrics already registered
 java.lang.IllegalArgumentException: A metric named 
 .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already 
 exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name

2015-07-08 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618764#comment-14618764
 ] 

Aniket Bhatnagar commented on SPARK-8896:
-

As per the documentation (scala docs), only one SparkContext per JVM is 
allowed. However, no such documentation exists for StreamingContext. 
SparkContext makes an effort to ensure only one SparkContext instance exists in 
the JVM using contextBeingConstructed variable. However, no such effort is made 
in StreamingContext. This lead me to believe that multiple StreamingContexts 
are allowed in a JVM.

I can understand why only one SparkContext is allowed per JVM (global state, et 
al), but I don't think thats true for StreamingContext. There maybe genuine use 
cases to have 2 StreamingContext instances using the same SparkContext to 
leverage the same workers and have different batch intervals.

 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart

2015-05-21 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-7788:
---

 Summary: Streaming | Kinesis | KinesisReceiver blocks in onStart
 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1, 1.3.0
Reporter: Aniket Bhatnagar


KinesisReceiver calls worker.run() which is a blocking call (while loop) as per 
source code of kinesis-client library - 
https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.

This results in infinite loop while calling 
sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
perhaps because ReceiverTracker is never able to register the receiver (it's 
receiverInfo field is a empty map) causing it to be stuck in infinite loop 
while waiting for running flag to be set to false. 

Also, we should investigate a way to have receiver restart in case of failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-05-11 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537766#comment-14537766
 ] 

Aniket Bhatnagar commented on SPARK-6154:
-

Yes. I haven't tested this approach with scala 2.11 though but I believe it 
should work. It is also possible to make autocompletion to work if we could 
write an adapter between old jline reader and new one.

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-05-05 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528089#comment-14528089
 ] 

Aniket Bhatnagar commented on SPARK-6154:
-

I ran into this as well and my fix was to update 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver to use latest jline 
and comment out the line 'CliDriver.getCommandCompletor.foreach((e) = 
reader.addCompletor(e))'. Yes, you do loose autocompletion but JDBC 
(HiveThriftServer2) works fine. What should be proper fix for this? Perhaps 
dropping the autocomplete feature?

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2015-02-16 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322518#comment-14322518
 ] 

Aniket Bhatnagar commented on SPARK-3638:
-

Did you build spark with kinesis-asl profile? The standard distribution does 
not have this profile and therefore you would have to roll your won as 
described in 
https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
 (mvn -Pkinesis-asl -DskipTests clean package). 

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies
 Fix For: 1.1.1, 1.2.0


 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 {code}
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 {code}
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-01-26 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293019#comment-14293019
 ] 

Aniket Bhatnagar commented on SPARK-2243:
-

I am also interested in having this fixed. Can someone please outline what are 
specific things that need to be fixed to make this work so that interested 
people can contribute?

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at 

[jira] [Created] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues

2015-01-19 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-5330:
---

 Summary: Core | Scala 2.11 | Transitive dependency on 
com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues
 Key: SPARK-5330
 URL: https://issues.apache.org/jira/browse/SPARK-5330
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar


Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. 
Users of jackson-module-scala had to to depend on the same version to avoid any 
class compatibility issues. However, since scala 2.11, jackson-module-scala is 
no longer published for version 2.3.1. Since the version 2.3.1 is quiet old, 
perhaps we should investigate upgrading to latest jackson-core. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues

2015-01-19 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283432#comment-14283432
 ] 

Aniket Bhatnagar commented on SPARK-5330:
-

One possible workaround is do define the dependency as follows :

com.fasterxml.jackson.module % jackson-module-scala_2.10 % 2.3.1 
excludeAll(
  ExclusionRule(organization = org.scala-lang),
  ExclusionRule(organization = org.scalatest)
) 

 Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core 
 :jackson-core:2.3.1 causes compatibility issues
 ---

 Key: SPARK-5330
 URL: https://issues.apache.org/jira/browse/SPARK-5330
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar

 Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. 
 Users of jackson-module-scala had to to depend on the same version to avoid 
 any class compatibility issues. However, since scala 2.11, 
 jackson-module-scala is no longer published for version 2.3.1. Since the 
 version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest 
 jackson-core. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-12 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar resolved SPARK-5164.
-
Resolution: Duplicate

Duplicates and has similar findings to SPARK-1825.

 YARN | Spark job submits from windows machine to a linux YARN cluster fail
 --

 Key: SPARK-5164
 URL: https://issues.apache.org/jira/browse/SPARK-5164
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Spark submit from Windows 7
 YARN cluster on CentOS 6.5
Reporter: Aniket Bhatnagar

 While submitting spark jobs from a windows machine to a linux YARN cluster, 
 the jobs fail because of the following reasons:
 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
 etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
 of linux's syntax ($JAVA_HOME, $PWD, etc).
 2. Paths in launch environment are delimited by semi-colon instead of colon. 
 This is because of usage of File.pathSeparator in YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-5164:
---

 Summary: YARN | Spark job submits from windows machine to a linux 
YARN cluster fail
 Key: SPARK-5164
 URL: https://issues.apache.org/jira/browse/SPARK-5164
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Spark submit from Windows 7
YARN cluster on CentOS 6.5
Reporter: Aniket Bhatnagar


While submitting spark jobs from a windows machine to a linux YARN cluster, the 
jobs fail because of the following reasons:

1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of 
linux's syntax ($JAVA_HOME, $PWD, etc).
2. Paths in launch environment are delimited by semi-colon instead of colon. 
This is because of usage of Path.SEPARATOR in ClientBase and 
YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-5164:

Description: 
While submitting spark jobs from a windows machine to a linux YARN cluster, the 
jobs fail because of the following reasons:

1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of 
linux's syntax ($JAVA_HOME, $PWD, etc).
2. Paths in launch environment are delimited by semi-colon instead of colon. 
This is because of usage of File.pathSeparator in ClientBase and 
YarnSparkHadoopUtil.

  was:
While submitting spark jobs from a windows machine to a linux YARN cluster, the 
jobs fail because of the following reasons:

1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of 
linux's syntax ($JAVA_HOME, $PWD, etc).
2. Paths in launch environment are delimited by semi-colon instead of colon. 
This is because of usage of Path.SEPARATOR in ClientBase and 
YarnSparkHadoopUtil.


 YARN | Spark job submits from windows machine to a linux YARN cluster fail
 --

 Key: SPARK-5164
 URL: https://issues.apache.org/jira/browse/SPARK-5164
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Spark submit from Windows 7
 YARN cluster on CentOS 6.5
Reporter: Aniket Bhatnagar

 While submitting spark jobs from a windows machine to a linux YARN cluster, 
 the jobs fail because of the following reasons:
 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
 etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
 of linux's syntax ($JAVA_HOME, $PWD, etc).
 2. Paths in launch environment are delimited by semi-colon instead of colon. 
 This is because of usage of File.pathSeparator in ClientBase and 
 YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-08 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-5164:

Description: 
While submitting spark jobs from a windows machine to a linux YARN cluster, the 
jobs fail because of the following reasons:

1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of 
linux's syntax ($JAVA_HOME, $PWD, etc).
2. Paths in launch environment are delimited by semi-colon instead of colon. 
This is because of usage of File.pathSeparator in YarnSparkHadoopUtil.

  was:
While submitting spark jobs from a windows machine to a linux YARN cluster, the 
jobs fail because of the following reasons:

1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead of 
linux's syntax ($JAVA_HOME, $PWD, etc).
2. Paths in launch environment are delimited by semi-colon instead of colon. 
This is because of usage of File.pathSeparator in ClientBase and 
YarnSparkHadoopUtil.


 YARN | Spark job submits from windows machine to a linux YARN cluster fail
 --

 Key: SPARK-5164
 URL: https://issues.apache.org/jira/browse/SPARK-5164
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Spark submit from Windows 7
 YARN cluster on CentOS 6.5
Reporter: Aniket Bhatnagar

 While submitting spark jobs from a windows machine to a linux YARN cluster, 
 the jobs fail because of the following reasons:
 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
 etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
 of linux's syntax ($JAVA_HOME, $PWD, etc).
 2. Paths in launch environment are delimited by semi-colon instead of colon. 
 This is because of usage of File.pathSeparator in YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-08 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270533#comment-14270533
 ] 

Aniket Bhatnagar commented on SPARK-5164:
-

First issue can be fixed by using Environment.variable.$$() instead of 
Environment.variable.$() in ClientBase. But unfortunately, $$() method seems 
to be only added in recent versions of hadoop making it not a viable option if 
we want to support many versions of Hadoop.

I am not sure if it is possible to detect remote OS using YARN API. I am 
thinking that perhaps we should introduce a new configuration - 
spark.yarn.remote.os that hints about the target YARN OS an can take values - 
Windows or Linux. We can then use  this configuration in ClientBase and 
Path.SEPARATOR. I am happy to submit a pull request for this, once the 
recommendation is vetted by the community.

 YARN | Spark job submits from windows machine to a linux YARN cluster fail
 --

 Key: SPARK-5164
 URL: https://issues.apache.org/jira/browse/SPARK-5164
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Spark submit from Windows 7
 YARN cluster on CentOS 6.5
Reporter: Aniket Bhatnagar

 While submitting spark jobs from a windows machine to a linux YARN cluster, 
 the jobs fail because of the following reasons:
 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
 etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
 of linux's syntax ($JAVA_HOME, $PWD, etc).
 2. Paths in launch environment are delimited by semi-colon instead of colon. 
 This is because of usage of Path.SEPARATOR in ClientBase and 
 YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5143) spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-5143:
---

 Summary: spark-network-yarn 2.11 depends on spark-network-shuffle 
2.10
 Key: SPARK-5143
 URL: https://issues.apache.org/jira/browse/SPARK-5143
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar
Priority: Critical


It seems that spark-network-yarn compiled for scala 2.11 depends on 
spark-network-shuffle compiled for scala 2.10. This causes builds with SBT 
0.13.7 to fail with the error Conflicting cross-version suffixes.

Screenshot of dependency: 
http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5144) spark-yarn module should be published

2015-01-08 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-5144:
---

 Summary: spark-yarn module should be published
 Key: SPARK-5144
 URL: https://issues.apache.org/jira/browse/SPARK-5144
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar


We disabled publishing of certain modules in SPARK-3452. One of such modules is 
spark-yarn. This breaks applications that submit spark jobs programatically 
with master set as yarn-client. This is because SparkContext is dependent on 
classes from yarn-client module to submit the YARN application. 

Here is the stack trace that you get if you submit the spark job without 
yarn-client dependency:

2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - 
MemoryStore started with capacity 731.7 MB
Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
at org.apache.spark.SparkContext.init(SparkContext.scala:232)
at com.myimpl.Server:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194)
at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
... 27 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
... 29 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-08 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269145#comment-14269145
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

I have opened another defect - SPARK-5144 suggesting yarn-client to be 
published.

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-07 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267469#comment-14267469
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

Here is the exception I am getting while triggering a job that contains 
SparkContext having master as yarn-client. A quick look at 1.2.0 source code 
suggests I should depend on spark-yarn module which I can't as it is not longer 
published. Do you want me to log a separate defect for this and submit 
appropriate pull request? 

2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryS
tore started with capacity 731.7 MB
Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
at org.apache.spark.SparkContext.init(SparkContext.scala:232)
at com.myimpl.Server:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
at 
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
... 27 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
... 29 more


 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-06 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266156#comment-14266156
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

Ok.. I'll test this out by adding dependency to spark-network-yarn and see how 
it goes. Fingers crossed!

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-05 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265749#comment-14265749
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

I would like this to be revisited. The issue I am facing is that while people 
may not dependent on some modules during compile time but they may dependent on 
them during runtime. For example, I am building a spark server that lets users 
submit spark jobs using convenient REST endpoints. This used to work great even 
in yarn-client mode. However, once I migrate to 1.2.0, this breaks because I 
can no longer add dependency of my spark server to spark-yarn module which is 
used while submitting jobs to YARN cluster.

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4862) Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error

2014-12-16 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-4862:
---

 Summary: Streaming | Setting checkpoint as a local directory 
results in Checkpoint RDD has different partitions error
 Key: SPARK-4862
 URL: https://issues.apache.org/jira/browse/SPARK-4862
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.1, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor


If the checkpoint is set as a local filesystem directory, it results in weird 
error messages like the following:

org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[467] at apply at 
List.scala:318(0) has different number of partitions than original RDD 
MapPartitionsRDD[461] at mapPartitions at StateDStream.scala:71(56)

It would be great if Spark could output better error message that better hints 
at what could have gone wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-12-14 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar closed SPARK-3640.
---
Resolution: Not a Problem

Tested and Chris's suggestion of using EC2 IAM instance profile works fine.

 KinesisUtils should accept a credentials object instead of forcing 
 DefaultCredentialsProvider
 -

 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: kinesis

 KinesisUtils should accept AWS Credentials as a parameter and should default 
 to DefaultCredentialsProvider if no credentials are provided. Currently, the 
 implementation forces usage of DefaultCredentialsProvider which can be a pain 
 especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME

2014-12-08 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239060#comment-14239060
 ] 

Aniket Bhatnagar commented on SPARK-1350:
-

I am using hadoop 2.5.0 (CDH). Agreed that it handles for windows. But the use 
case I am talking about is when SparkContext is created programmatically on a 
windows machine and is used to submit jobs on a yarn cluster running on Linux. 
As per above code, %JAVA_HOME%/bin/java will be generated as one of the 
commands by ClientBase and submitted to YARN cluster. This will obviously fail 
while YARN tries to execute the container as % is treated differently in linux.

 YARN ContainerLaunchContext should use cluster's JAVA_HOME
 --

 Key: SPARK-1350
 URL: https://issues.apache.org/jira/browse/SPARK-1350
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.0.0


 {code}
 var javaCommand = java
 val javaHome = System.getenv(JAVA_HOME)
 if ((javaHome != null  !javaHome.isEmpty()) || 
 env.isDefinedAt(JAVA_HOME)) {
   javaCommand = Environment.JAVA_HOME.$() + /bin/java
 }
 {code}
 Currently, if JAVA_HOME is specified on the client, it will be used instead 
 of the value given on the cluster.  This makes it so that Java must be 
 installed in the same place on the client as on the cluster.
 This is a possibly incompatible change that we should get in before 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME

2014-12-07 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237525#comment-14237525
 ] 

Aniket Bhatnagar commented on SPARK-1350:
-

[~sandyr] Using Environment.JAVA_HOME.$() causes issues while submitting spark 
applications from a windows box into a Yarn cluster running on Linux (with 
spark master set as yarn-client). This is because Environment.JAVA_HOME.$() 
resolves to %JAVA_HOME% which results in not a valid executable on Linux. Is 
this a Spark issue or a YARN issue?

 YARN ContainerLaunchContext should use cluster's JAVA_HOME
 --

 Key: SPARK-1350
 URL: https://issues.apache.org/jira/browse/SPARK-1350
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.0.0


 {code}
 var javaCommand = java
 val javaHome = System.getenv(JAVA_HOME)
 if ((javaHome != null  !javaHome.isEmpty()) || 
 env.isDefinedAt(JAVA_HOME)) {
   javaCommand = Environment.JAVA_HOME.$() + /bin/java
 }
 {code}
 Currently, if JAVA_HOME is specified on the client, it will be used instead 
 of the value given on the cluster.  This makes it so that Java must be 
 installed in the same place on the client as on the cluster.
 This is a possibly incompatible change that we should get in before 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-12-03 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232719#comment-14232719
 ] 

Aniket Bhatnagar commented on SPARK-3638:
-

Yes. You may want to open another JIRA ticket for having kinesis pre build 
packages in spark download page.

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies
 Fix For: 1.1.1, 1.2.0


 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 {code}
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 {code}
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-12-01 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231086#comment-14231086
 ] 

Aniket Bhatnagar commented on SPARK-3638:
-

[~ashrafuzzaman] did you build using using kinesis-asl profile as described in 
https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
 (mvn -Pkinesis-asl -DskipTests clean package)? None of the pre build spark 
distributions have the profile enabled. I just pulled down 1.1.1 source code 
and did a build as mentioned in the Kinesis integration documentation and I can 
see the class the in the built assembly.

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies
 Fix For: 1.1.1, 1.2.0


 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 {code}
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 {code}
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-11-19 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218063#comment-14218063
 ] 

Aniket Bhatnagar commented on SPARK-2321:
-

Just a quick question, will the API be usable from job submitter process in 
yarn-cluster mode (i.e. when the driver is running as a separate YARN process?)?

 Design a proper progress reporting  event listener API
 ---

 Key: SPARK-2321
 URL: https://issues.apache.org/jira/browse/SPARK-2321
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.2.0


 This is a ticket to track progress on redesigning the SparkListener and 
 JobProgressListener API.
 There are multiple problems with the current design, including:
 0. I'm not sure if the API is usable in Java (there are at least some enums 
 we used in Scala and a bunch of case classes that might complicate things).
 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
 attention to it yet. Something as important as progress reporting deserves a 
 more stable API.
 2. There is no easy way to connect jobs with stages. Similarly, there is no 
 easy way to connect job groups with jobs / stages.
 3. JobProgressListener itself has no encapsulation at all. States can be 
 arbitrarily mutated by external programs. Variable names are sort of randomly 
 decided and inconsistent. 
 We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4473) [Core] StageInfo should have ActiveJob's group ID as a field

2014-11-18 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-4473:
---

 Summary: [Core] StageInfo should have ActiveJob's group ID as a 
field
 Key: SPARK-4473
 URL: https://issues.apache.org/jira/browse/SPARK-4473
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aniket Bhatnagar
Priority: Minor


It would be convenient to have active job's group ID in StageInfo so that 
JobProgressListener can be used to track specific job/jobs in a group.

Perhaps, stage's group ID can also be shown in default Spark's UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-11-04 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196534#comment-14196534
 ] 

Aniket Bhatnagar commented on SPARK-3640:
-

Thanks Chris for looking into this. This documentation would certainly be 
useful. However, the problem that I am facing with using 
DefaultCredentialsProvider is that each node in the cluster needs to be setup 
to have those credentials in the user's home directory and this is a bit 
tedious. I would like the driver to pass credentials to all nodes in the 
cluster to avoid such an operational overhead. I have submitted a pull request 
that contains the changes I had to make to allow driver to pass user defined 
credentials. Do have a look and let me know if there is a better way.

https://github.com/apache/spark/pull/3092

 KinesisUtils should accept a credentials object instead of forcing 
 DefaultCredentialsProvider
 -

 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: kinesis

 KinesisUtils should accept AWS Credentials as a parameter and should default 
 to DefaultCredentialsProvider if no credentials are provided. Currently, the 
 implementation forces usage of DefaultCredentialsProvider which can be a pain 
 especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-09-22 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-3638:
---

 Summary: Commons HTTP client dependency conflict in 
extras/kinesis-asl module
 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar


Followed instructions as mentioned @ 
https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
 and when running the example, I get the following error:

Caused by: java.lang.NoSuchMethodError: 
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
at 
com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
at 
com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
at 
com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
at 
com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)


I believe this is due to the dependency conflict as described @ 
http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3639) Kinesis examples set master as local

2014-09-22 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-3639:
---

 Summary: Kinesis examples set master as local
 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0, 1.0.2
Reporter: Aniket Bhatnagar
Priority: Minor


Kinesis examples set master as local thus not allowing the example to be tested 
on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-09-22 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-3638:

Component/s: Streaming

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies

 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-22 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143108#comment-14143108
 ] 

Aniket Bhatnagar commented on SPARK-3639:
-

If the community agrees this is an issue, I can submit a PR for this 

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor
  Labels: examples

 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-09-22 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-3640:
---

 Summary: KinesisUtils should accept a credentials object instead 
of forcing DefaultCredentialsProvider
 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar


KinesisUtils should accept AWS Credentials as a parameter and should default to 
DefaultCredentialsProvider if no credentials are provided. Currently, the 
implementation forces usage of DefaultCredentialsProvider which can be a pain 
especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-09-22 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143111#comment-14143111
 ] 

Aniket Bhatnagar commented on SPARK-3640:
-

I understand that the credentials need to be serializable so that they are 
transmitted to workers. I can put in a PR request to make this possible.

 KinesisUtils should accept a credentials object instead of forcing 
 DefaultCredentialsProvider
 -

 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: kinesis

 KinesisUtils should accept AWS Credentials as a parameter and should default 
 to DefaultCredentialsProvider if no credentials are provided. Currently, the 
 implementation forces usage of DefaultCredentialsProvider which can be a pain 
 especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org