[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Selin updated SPARK-12089:
---
Description: 
When running a large spark sql query including multiple joins I see tasks 
failing with the following trace:

{code}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

>From the spark code it looks like this is due to a integer overflow when 
>growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
>following in the version I'm running:

{code}
final byte[] tmp = new byte[length * 2];
{code}

This seems to indicate to me that this buffer will never be able to hold more 
then 2G worth of data. And likely will hold even less since any length > 
1073741824 will cause a integer overflow and turn the new buffer size negative.

I hope I'm simply missing some critical config setting but it still seems weird 
that we have a (rather low) upper limit on these buffers. 

  was:
When running a large spark sql query including multiple joins I see tasks 
failing with the following trace:

{code}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

>From the spark code it looks like this is due to a integer overflow when 
>growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
>following in the version I'm running:

{code}
final byte[] tmp = new byte[length * 2];
{code}

This seems to indicate to me that this buffer will never be able to hold more 
then 2G worth of data. And likely will hold even less since any length > 
1073741824 will cause a integer overflow and turn negative once we double it.
I'm still digging down to try to pin point what is actually responsible for 
managing how big this should be grown. I figure we cannot simply add some check 
here to keep it from going negative since arguably we do need the buffer to 
grow this big?


> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: 

[jira] [Comment Edited] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015
 ] 

Erik Selin edited comment on SPARK-12089 at 12/2/15 4:15 PM:
-

I can make that change if it is that easy. I'm just wondering if that's really 
enough? What would happen when we need a buffer to hold more then Integer.Max?


was (Author: tyro89):
I can make that change if it is that easy. I'm just wondering if that's really 
enough? What would happen if we actually need a buffer to hold more then 
Integer.Max?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015
 ] 

Erik Selin edited comment on SPARK-12089 at 12/2/15 4:07 PM:
-

I can make that change if it is that easy. I'm just wondering if that's really 
enough? What would happen if we actually need a buffer to hold more then 
Integer.Max?


was (Author: tyro89):
I can make that change if it is that easy. I'm just wondering if that's really 
enough? What would happen if we actually need this buffer to hold more then 
Integer.Max?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12096) remove the old constraint in word2vec

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12096:


Assignee: Apache Spark

> remove the old constraint in word2vec
> -
>
> Key: SPARK-12096
> URL: https://issues.apache.org/jira/browse/SPARK-12096
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> word2vec now can handle much bigger vocabulary. 
> The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015
 ] 

Erik Selin commented on SPARK-12089:


I can make that change if it is that easy. I'm just wondering if that's really 
enough? What would happen if we actually need this buffer to hold more then 
Integer.Max?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036004#comment-15036004
 ] 

Wenchen Fan commented on SPARK-12089:
-

yea, I think we should be more conservative to grow the buffer array, maybe we 
can check the `length`, if `length > Integer.Max / 2`, we should just use 
`Integer.Max` as new array size, instead of `length * 2`.  cc [~davies]

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2015-12-02 Thread Christoph Pirkl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035873#comment-15035873
 ] 

Christoph Pirkl commented on SPARK-10969:
-

While this commit is useful it does not fix this issue. In order to fix this, 
it would be best to introduce a serializable parameter object that contains aws 
credentials for kinesis, dynamodb and cloudwatch (and maybe other values). This 
would make it easier to add more parameters later.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2015-12-02 Thread Brian London (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035963#comment-15035963
 ] 

Brian London commented on SPARK-10969:
--

I'm not sure why it needs to be an object as opposed to two strings, or is the 
issue that this change only adds custom credentials to kinesis and not dynamodb 
and cloudwatch?  There are already the `fs.s3.awsAccessKeyId` and 
`fs.s3.awsSecretAccessKey` parameters that are used for the `.saveToTextFile` 
RDD method.  Perhaps something like that could be used for the other services 
as well.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2015-12-02 Thread Brian London (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035963#comment-15035963
 ] 

Brian London edited comment on SPARK-10969 at 12/2/15 3:31 PM:
---

I'm not sure why it needs to be an object as opposed to two strings, or is the 
issue that this change only adds custom credentials to kinesis and not dynamodb 
and cloudwatch?  

There are already the `fs.s3.awsAccessKeyId` and `fs.s3.awsSecretAccessKey` 
parameters that are used for the `.saveToTextFile` RDD method.  Perhaps 
something like that could be used for the other services as well.


was (Author: brianlondon):
I'm not sure why it needs to be an object as opposed to two strings, or is the 
issue that this change only adds custom credentials to kinesis and not dynamodb 
and cloudwatch?  There are already the `fs.s3.awsAccessKeyId` and 
`fs.s3.awsSecretAccessKey` parameters that are used for the `.saveToTextFile` 
RDD method.  Perhaps something like that could be used for the other services 
as well.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12098) Cross validator with multi-arm bandit search

2015-12-02 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-12098:
-

 Summary: Cross validator with multi-arm bandit search
 Key: SPARK-12098
 URL: https://issues.apache.org/jira/browse/SPARK-12098
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xusen Yin


The classic cross-validation requires all inner classifiers iterate to a fixed 
number of iterations, or until convergence states. It is costly especially in 
the massive data scenario. According to the paper Non-stochastic Best Arm 
Identification and Hyperparameter Optimization 
(http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
the amount of total iterations of cross-validation with multi-armed bandit 
search.

The multi-armed bandit search for cross-validation (bandit search for short) 
requires warm-start of ml algorithms, and fine-grained control of the inner 
behavior of the corss validator.

Since there are bunch of algorithms of bandit search to find the best parameter 
set, we intent to provide only a few of them in the beginning to reduce the 
test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12096) remove the old constraint in word2vec

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12096:


Assignee: (was: Apache Spark)

> remove the old constraint in word2vec
> -
>
> Key: SPARK-12096
> URL: https://issues.apache.org/jira/browse/SPARK-12096
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: yuhao yang
>Priority: Minor
>
> word2vec now can handle much bigger vocabulary. 
> The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12096) remove the old constraint in word2vec

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035864#comment-15035864
 ] 

Apache Spark commented on SPARK-12096:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/10103

> remove the old constraint in word2vec
> -
>
> Key: SPARK-12096
> URL: https://issues.apache.org/jira/browse/SPARK-12096
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: yuhao yang
>Priority: Minor
>
> word2vec now can handle much bigger vocabulary. 
> The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Selin updated SPARK-12089:
---
Description: 
When running a large spark sql query including multiple joins I see tasks 
failing with the following trace:

{code}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

>From the spark code it looks like this is due to a integer overflow when 
>growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
>following in the version I'm running:

{code}
final byte[] tmp = new byte[length * 2];
{code}

This seems to indicate to me that this buffer will never be able to hold more 
then 2G worth of data. And likely will hold even less since any length > 
1073741824 will cause a integer overflow and turn negative once we double it.
I'm still digging down to try to pin point what is actually responsible for 
managing how big this should be grown. I figure we cannot simply add some check 
here to keep it from going negative since arguably we do need the buffer to 
grow this big?

  was:
When running a rather large spark sql query including multiple joins I see 
tasks failing with the following trace:

{code}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
at 
org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

>From the code it looks like this is due to a doubling of a target length that 
>in my case makes the new buffer length flip to negative due to what i assume 
>is just too much data?

The offending line {{BufferHolder.java:36}} is the following in the version I'm 
running:

{code}
final byte[] tmp = new byte[length * 2];
{code}

I'm still digging down to try to pin point what is actually responsible for 
managing how big this should be grown. I figure we cannot simply add some check 
here to keep it from going negative since arguably we do need the buffer to 
grow this big?


> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: 

[jira] [Created] (SPARK-12097) How to do a cached, batched JDBC-lookup in Spark Streaming?

2015-12-02 Thread Christian Kurz (JIRA)
Christian Kurz created SPARK-12097:
--

 Summary: How to do a cached, batched JDBC-lookup in Spark 
Streaming?
 Key: SPARK-12097
 URL: https://issues.apache.org/jira/browse/SPARK-12097
 Project: Spark
  Issue Type: Brainstorming
  Components: Streaming
Reporter: Christian Kurz


h3. Use-case
I need to enrich incoming Kafka data with data from a lookup table (or query) 
on a relational database. Lookup data is changing slowly over time (So caching 
is okay for a certain retention time). Lookup data is potentially huge (So 
loading all data upfront is not option).

h3. Problem
The overall design idea is to implement a cached and batched JDBC lookup. That 
is, for any lookup keys, which are missing from the lookup cache, a JDBC lookup 
is done to retrieve the missing lookup data. JDBC lookups are rather expensive 
(connection overhead, number of round-trips) and therefore must be done in 
batches. E.g. one JDBC lookup per 100 missing keys.

So the high-level logic might look something like this:

# For every Kafka RDD we extract all lookup keys
# For all lookup keys we check whether the lookup data is already available 
already in cache and whether this cached information has not expired, yet.
# For any lookup keys not found in cache (or expired), we send batched prepared 
JDBC Statements to the database to fetch the missing lookup data:
{{SELECT c1, c2, c3 FROM ... WHERE k1 in (?,?,?,...)}}
to minimize the number of JDBC round-trips.
# At this point we have up-to-date lookup data for all lookup keys and can 
perform the actual lookup operation.

Does this approach make sense on Spark? Would Spark State DStreams be the right 
way to go? Or other design approaches?

Assuming Spark State DStreams are the right direction, the low-level question 
is how to do the batching?

Would this particular signature of DStream.updateStateByKey ( iterator-> 
iterator):

{code:borderStyle=solid}
def updateStateByKey[S: ClassTag](
  updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
  partitioner: Partitioner,
  rememberPartitioner: Boolean,
  initialRDD: RDD[(K, S)]
)
{code}

be the right way to batch multiple incoming keys into a single JDBC-lookup 
query?

Would the new {{DStream.trackStateByKey()}} be a better approach?


The second more high-level question: is there a way to chain multiple state 
operations on the same state object?

Going with the above design approach the entire lookup logic would be 
handcrafted into some java/scala/python {{updateFunc}}. This function would go 
over all incoming keys, check which ones are missing from cache, batch the 
missing ones, run the JDBC queries and union the returned lookup data with the 
existing cache from the State object.

The fact that all of this must be handcrafted into a single function seems to 
be caused by the fact that Spark State processing logic on a high-level works 
like this:

{code:borderStyle=solid}
input: prevStateRdd, inputRDD
output: updateStateFunc( prevStateRdd, inputRdd )}}
{code}

So only a single updateStateFunc operating on prevStateRdd and inputRdd in one 
go. Once done there is no way to further refine the State as part of the 
current micro batch.

The multi-step processing required here sounds like a typical use-case for a 
DStream: apply multiple operations one after the other on some incoming data. 
So I wonder whether there is a way to extend the concept of state processing 
(may be it already has been extended?) to do something like:

{code:borderStyle=solid}
*input: prevStateRdd, inputRdd*
missingKeys = inputRdd.filter(  )  
foundKeys   = inputRdd.filter(  )  
newLookupData   = lookupKeysUsingJdbcDataFrameRead( missingKeys.collect() )
newStateRdd = newLookupData.union( foundKeys).union( prevStateRdd )
*output: newStateRdd*
{code}

This would nicely leverage all the power and richness of Spark. The only 
missing bit - and the reason why this approach does not work today (based on my 
naive understanding of Spark - is that {{newStateRdd}} cannot be declared to be 
the {{prevStateRdd}} of the next micro batch.

If Spark had a way of declaring an RDD (or DStream to be the parent for the 
next batch run), even complex (chained) state operations would be easy to 
describe and would not require hand-written Java/Python/Scala updateFunctions.


Thanks a lot for taking the time to read all of this!!!

Any thoughts/pointers are much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-02 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036337#comment-15036337
 ] 

holdenk commented on SPARK-12040:
-

Working on this :)

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12094) Better format for query plan tree string

2015-12-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12094.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10099
[https://github.com/apache/spark/pull/10099]

> Better format for query plan tree string
> 
>
> Key: SPARK-12094
> URL: https://issues.apache.org/jira/browse/SPARK-12094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.6.0
>
>
> When examine plans of complex queries with multiple joins, a pain point of 
> mine is that, it's hard to immediately see the sibling node of a specific 
> query plan node. For example:
> {noformat}
> TakeOrderedAndProject ...
>  ConvertToSafe
>   Project ...
>TungstenAggregate ...
> TungstenExchange ...
>  TungstenAggregate ...
>   Project ...
>BroadcastHashJoin ...
> Project ...
>  BroadcastHashJoin ...
>   Project ...
>BroadcastHashJoin ...
> Scan ...
> Filter ...
>  Scan ...
>   Scan ...
> Project ...
>  Filter ...
>   Scan ...
> {noformat}
> Would be better to have tree lines to indicate relationships between plan 
> nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12089:
---
Priority: Critical  (was: Major)

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12097) How to do a cached, batched JDBC-lookup in Spark Streaming?

2015-12-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036116#comment-15036116
 ] 

Sean Owen commented on SPARK-12097:
---

Normally I'd say we don't use JIRA for discussion (i.e. we don't use the 
Brainstorming type and can't delete it), but since the writeup here is good, 
and I have some back-story on the use case here which makes me believe it 
_could_ lead to a change, let's leave it for the moment.

Why not simply query once per batch for all the rows you need and cache them? 
this need not involve anything specific to Spark. {{updateStateByKey}} is more 
for when you want Spark to maintain some persistent data, but you are already 
maintaining it (in your RDBMS). It's unnecessarily complex here.

> How to do a cached, batched JDBC-lookup in Spark Streaming?
> ---
>
> Key: SPARK-12097
> URL: https://issues.apache.org/jira/browse/SPARK-12097
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Streaming
>Reporter: Christian Kurz
>
> h3. Use-case
> I need to enrich incoming Kafka data with data from a lookup table (or query) 
> on a relational database. Lookup data is changing slowly over time (So 
> caching is okay for a certain retention time). Lookup data is potentially 
> huge (So loading all data upfront is not option).
> h3. Problem
> The overall design idea is to implement a cached and batched JDBC lookup. 
> That is, for any lookup keys, which are missing from the lookup cache, a JDBC 
> lookup is done to retrieve the missing lookup data. JDBC lookups are rather 
> expensive (connection overhead, number of round-trips) and therefore must be 
> done in batches. E.g. one JDBC lookup per 100 missing keys.
> So the high-level logic might look something like this:
> # For every Kafka RDD we extract all lookup keys
> # For all lookup keys we check whether the lookup data is already available 
> already in cache and whether this cached information has not expired, yet.
> # For any lookup keys not found in cache (or expired), we send batched 
> prepared JDBC Statements to the database to fetch the missing lookup data:
> {{SELECT c1, c2, c3 FROM ... WHERE k1 in (?,?,?,...)}}
> to minimize the number of JDBC round-trips.
> # At this point we have up-to-date lookup data for all lookup keys and can 
> perform the actual lookup operation.
> Does this approach make sense on Spark? Would Spark State DStreams be the 
> right way to go? Or other design approaches?
> Assuming Spark State DStreams are the right direction, the low-level question 
> is how to do the batching?
> Would this particular signature of DStream.updateStateByKey ( iterator-> 
> iterator):
> {code:borderStyle=solid}
> def updateStateByKey[S: ClassTag](
>   updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
>   partitioner: Partitioner,
>   rememberPartitioner: Boolean,
>   initialRDD: RDD[(K, S)]
> )
> {code}
> be the right way to batch multiple incoming keys into a single JDBC-lookup 
> query?
> Would the new {{DStream.trackStateByKey()}} be a better approach?
> The second more high-level question: is there a way to chain multiple state 
> operations on the same state object?
> Going with the above design approach the entire lookup logic would be 
> handcrafted into some java/scala/python {{updateFunc}}. This function would 
> go over all incoming keys, check which ones are missing from cache, batch the 
> missing ones, run the JDBC queries and union the returned lookup data with 
> the existing cache from the State object.
> The fact that all of this must be handcrafted into a single function seems to 
> be caused by the fact that Spark State processing logic on a high-level works 
> like this:
> {code:borderStyle=solid}
> input: prevStateRdd, inputRDD
> output: updateStateFunc( prevStateRdd, inputRdd )}}
> {code}
> So only a single updateStateFunc operating on prevStateRdd and inputRdd in 
> one go. Once done there is no way to further refine the State as part of the 
> current micro batch.
> The multi-step processing required here sounds like a typical use-case for a 
> DStream: apply multiple operations one after the other on some incoming data. 
> So I wonder whether there is a way to extend the concept of state processing 
> (may be it already has been extended?) to do something like:
> {code:borderStyle=solid}
> *input: prevStateRdd, inputRdd*
> missingKeys = inputRdd.filter(  )  
> foundKeys   = inputRdd.filter(  )  
> newLookupData   = lookupKeysUsingJdbcDataFrameRead( missingKeys.collect() )
> newStateRdd = newLookupData.union( foundKeys).union( prevStateRdd )
> *output: newStateRdd*
> {code}
> This would nicely leverage all the power and richness of Spark. The only 
> missing bit - and the reason why this approach does not work today 

[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036194#comment-15036194
 ] 

Davies Liu commented on SPARK-12089:


Is it possible that you have a record larger than 1G? I don't think so, or 
there are many places will break.

Does this task always fail or not? If not, it means the length get corrupt 
somewhere, that's the root cause.

Can you reproduce this?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12099) Standalone and Mesos Should use OnOutOfMemoryError handlers

2015-12-02 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-12099:


 Summary: Standalone and Mesos Should use OnOutOfMemoryError 
handlers
 Key: SPARK-12099
 URL: https://issues.apache.org/jira/browse/SPARK-12099
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos
Affects Versions: 1.6.0
Reporter: Imran Rashid


cc [~andrewor14] [~tnachen]

On SPARK-11801, we've been discussing the use of {{OnOutMemoryError}} to 
terminate the jvm in yarn mode.  There seems to be consensus that this is 
indeed the right thing to do.  I assume that the other cluster managers should 
also be doing the same thing.  Though maybe there is a good reason for not 
including it under standalone & mesos mode (or perhaps this is already 
happening via some other mechanism I'm not seeing).  In any case, I thought it 
was worth drawing your attention to it, I didn't see this discussed in any 
previous issue.

(Note that there are currently some drawbacks to using {{OnOutOfMemoryError}}, 
in that you get some confusing msgs, but hopefully SPARK-11801 will address 
that.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036247#comment-15036247
 ] 

Erik Selin commented on SPARK-12089:


There shouldn't be a single record larger than 1G no. But I'm doing a group by 
month so I was guessing that a large amount of pre-aggregated data might end up 
in the same buffer?

I have tasks failing and later succeeding and then there's others that seems to 
just kill the job. I'm not deeply familiar with this area of spark but I guess 
that does mean that we might be failing to reclaim parts of the buffer for 
later use? I can reproduce but it takes me ~ 4 hours to get to the failing 
stage :)

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036295#comment-15036295
 ] 

Davies Liu commented on SPARK-12089:


[~tyro89] Are you build a large Array using group by? How is the query looks 
like?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12100) bug in spark/python/pyspark/rdd.py portable_hash()

2015-12-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12100:
---

 Summary: bug in spark/python/pyspark/rdd.py portable_hash()
 Key: SPARK-12100
 URL: https://issues.apache.org/jira/browse/SPARK-12100
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Andrew Davidson
Priority: Minor


I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I get and exception 'Randomness of hash of string 
should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
just set PYTHONHASHSEED ?

Should I file a bug?

Kind regards

Andy

details

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract

Example from documentation does not work out of the box

Subtract(other, numPartitions=None)
Return each value in self that is not contained in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 

if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")


The following script fixes the problem 

Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done


This is how I am starting spark

export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  

$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*


see email thread "possible bug spark/python/pyspark/rdd.py portable_hash()' on 
user@spark for more info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12040:


Assignee: Apache Spark

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036338#comment-15036338
 ] 

Apache Spark commented on SPARK-12040:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10106

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12040:


Assignee: (was: Apache Spark)

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036280#comment-15036280
 ] 

Davies Liu commented on SPARK-12089:


Could you turn on debug log, and paste the java source code of the 
SpecificUnsafeProjection (generated unsafe projection) here?

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12098) Cross validator with multi-arm bandit search

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12098:


Assignee: (was: Apache Spark)

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12098) Cross validator with multi-arm bandit search

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036231#comment-15036231
 ] 

Apache Spark commented on SPARK-12098:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10105

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12098) Cross validator with multi-arm bandit search

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12098:


Assignee: Apache Spark

> Cross validator with multi-arm bandit search
> 
>
> Key: SPARK-12098
> URL: https://issues.apache.org/jira/browse/SPARK-12098
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xusen Yin
>Assignee: Apache Spark
>
> The classic cross-validation requires all inner classifiers iterate to a 
> fixed number of iterations, or until convergence states. It is costly 
> especially in the massive data scenario. According to the paper 
> Non-stochastic Best Arm Identification and Hyperparameter Optimization 
> (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce 
> the amount of total iterations of cross-validation with multi-armed bandit 
> search.
> The multi-armed bandit search for cross-validation (bandit search for short) 
> requires warm-start of ml algorithms, and fine-grained control of the inner 
> behavior of the corss validator.
> Since there are bunch of algorithms of bandit search to find the best 
> parameter set, we intent to provide only a few of them in the beginning to 
> reduce the test/perf-test work and make it more stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11801) Notify driver when OOM is thrown before executor JVM is killed

2015-12-02 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036302#comment-15036302
 ] 

Imran Rashid commented on SPARK-11801:
--

to summarize, it seems we agree that:

1) we want to keep {{OnOutOfMemoryError}} in yarn mode.
2) we can't guarantee anything making it back to the driver at all on OOM

I *think* there is agreement that:

3) the current situation can lead to misleading msgs, which can be improved

and we are not totally agreed on:

4) should other cluster managers user {{OnOutOfMemoryError}}?  But this is 
independent, I've opened SPARK-12099 to deal with that, and I think we can 
ignore that here
5) *how* exactly should we improve the msgs?

So the main thing to discuss is #5.  I think there are a few options:

(a) given that we can't guarantee anything useful gets back to the driver, lets 
try to keep things consistent and always have the driver just receive a generic 
"executor lost" type of message.  We could still try to improve the logs on the 
executor somewhat with the shtudown reprioritization

(b) make a best effort at getting a better error msg to the driver, and improve 
the logs on the executor.  I think this would include all of
(i) handle OOM specially in {{TaskRunner}}, sending a msg back to the driver 
immediately
(ii) kill the running tasks, in a handler that is higher priority than the disk 
cleanup
(iii) have {{YarnAllocator}} deal with {{SparkExitCode.OOM}}

There are probably more details to discuss on (b), I am not sure that proposal 
is 100% correct, but maybe to start, can we discuss (a) vs (b)?  
[~mrid...@yahoo-inc.com] you voiced a preference for (a) -- but do you feel 
very strongly about that?  I have a strong preference (b), since I think we can 
do decently in most cases, and it would really help a lot of users out.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---
>
> Key: SPARK-11801
> URL: https://issues.apache.org/jira/browse/SPARK-11801
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Erik Selin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036327#comment-15036327
 ] 

Erik Selin commented on SPARK-12089:


It's a bunch of table joins followed by a group by on multiple fields. One of 
the fields being a month window. Example:

{code}
select x.a, x.b, x.c, x.month from (
  select a, b, c, concat(year(from_unixtime(t)), "-", month(from_unixtime(t))) 
AS month
  from foo
  left join bar on foo.bar_id = bar.id
  left join biz on foo.biz_id = biz.id
) as x
group by x.a, x.b, x.c, x.month
{code}

My hypothesis, but I need your expertise to confirm/deny since I'm really not 
familiar with this area of sparks codebase. Is that the monthly group by is 
indeed creating something huge, perhaps by bucketing a lot of data together 
into the same buffer? I have similar jobs running very similar queries but 
grouped by day and they are not running into this issue.

I'll do a debug log run once I have some spare cycles on my end! :)

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-12-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036341#comment-15036341
 ] 

Thomas Graves commented on SPARK-1239:
--

I have another user hitting this also.  The above mentions other issues that 
need to be addressed in MapOutputStatusTracker do you have links to those other 
issues?

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2015-12-02 Thread Christoph Pirkl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036418#comment-15036418
 ] 

Christoph Pirkl commented on SPARK-10969:
-

This issue is about adding additional (optional) credentials for dynamodb and 
cloudwatch. In my opinion it would be better to introduce a parameter object 
containing all credentials than to have four additional String parameters.

I don't think it is possible to use the same mechanism as in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L98
 (i.e. reading the credentials from System.getenv()) as this would mean 
defining environment variables on each node.

A generic configuration using a map would mean a potential error source when 
you mistype a parameter name.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2015-12-02 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-11219:
-
Description: 
There are several different formats for describing params in PySpark.MLlib, 
making it unclear what the preferred way to document is, i.e. vertical 
alignment vs single line.

This is to agree on a format and make it consistent across PySpark.MLlib.

Following the discussion in SPARK-10560, using 2 lines with an indentation is 
both readable and doesn't lead to changing many lines when adding/removing 
parameters.  If the parameter uses a default value, put this in parenthesis in 
a new line under the description.

Example:
{noformat}
:param stepSize:
  Step size for each iteration of gradient descent.
  (default: 0.1)
:param numIterations:
  Number of iterations run for each batch of data.
  (default: 50)
{noformat}

h2. Current State of Parameter Description Formating

h4. Classification
  * LogisticRegressionModel - single line descriptions, fix indentations
  * LogisticRegressionWithSGD - vertical alignment, sporatic default values
  * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
  * SVMModel - single line
  * SVMWithSGD - vertical alignment, sporatic default values
  * NaiveBayesModel - single line
  * NaiveBayes - single line

h4. Clustering
  * KMeansModel - missing param description
  * KMeans - missing param description and defaults
  * GaussianMixture - vertical align, incorrect default formatting
  * PowerIterationClustering - single line with wrapped indentation, missing 
defaults
  * StreamingKMeansModel - single line wrapped
  * StreamingKMeans - single line wrapped, missing defaults
  * LDAModel - single line
  * LDA - vertical align, mising some defaults

h4. FPM  
  * FPGrowth - single line
  * PrefixSpan - single line, defaults values in backticks

h4. Recommendation
  * ALS - does not have param descriptions

h4. Regression
  * LabeledPoint - single line
  * LinearModel - single line
  * LinearRegressionWithSGD - vertical alignment
  * RidgeRegressionWithSGD - vertical align
  * IsotonicRegressionModel - single line
  * IsotonicRegression - single line, missing default

h4. Tree
  * DecisionTree - single line with vertical indentation, missing defaults
  * RandomForest - single line with wrapped indent, missing some defaults
  * GradientBoostedTrees - single line with wrapped indent

NOTE
This issue will just focus on model/algorithm descriptions, which are the 
largest source of inconsistent formatting
evaluation.py, feature.py, random.py, utils.py - these supporting classes have 
param descriptions as single line, but are consistent so don't need to be 
changed

  was:
There are several different formats for describing params in PySpark.MLlib, 
making it unclear what the preferred way to document is, i.e. vertical 
alignment vs single line.

This is to agree on a format and make it consistent across PySpark.MLlib.

Following the discussion in SPARK-10560, using 2 lines with an indentation is 
both readable and doesn't lead to changing many lines when adding/removing 
parameters.  If the parameter uses a default value, put this in parenthesis in 
a new line under the description.

Example:
{noformat}
:param stepSize:
  Step size for each iteration of gradient descent.
  (default: 0.1)
:param numIterations:
  Number of iterations run for each batch of data.
  (default: 50)
{noformat}


> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default 

[jira] [Commented] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036487#comment-15036487
 ] 

Sean Owen commented on SPARK-12101:
---

To cut down the noise, if this is logically the same issue as one you recently 
fixed, they should be attached to one JIRA. That is, someone who gets your 
first fix wants the second one; they aren't really separable.

> Fix thread pools that cannot cache tasks in Worker and AppClient
> 
>
> Key: SPARK-12101
> URL: https://issues.apache.org/jira/browse/SPARK-12101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
> It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10873) can't sort columns on history page

2015-12-02 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10873:
--
Assignee: Zhuo Liu

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036375#comment-15036375
 ] 

Apache Spark commented on SPARK-11155:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/10107

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Assignee: Xin Ren
>Priority: Minor
>  Labels: Starter
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Description: 
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER((

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE




  was:
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER((

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE





> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Dan Dutrow
> Fix For: 1.0.1
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER((
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong

2015-12-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036395#comment-15036395
 ] 

Thomas Graves commented on SPARK-11701:
---

Also seems related to https://github.com/apache/spark/pull/9288

> YARN - dynamic allocation and speculation active task accounting wrong
> --
>
> Key: SPARK-11701
> URL: https://issues.apache.org/jira/browse/SPARK-11701
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> I am using dynamic container allocation and speculation and am seeing issues 
> with the active task accounting.  The Executor UI still shows active tasks on 
> the an executor but the job/stage is all completed.  I think its also 
> affecting the dynamic allocation being able to release containers because it 
> thinks there are still tasks.
> Its easily reproduce by using spark-shell, turn on dynamic allocation, then 
> run just a wordcount on decent sized file and set the speculation parameters 
> low: 
>  spark.dynamicAllocation.enabled true
>  spark.shuffle.service.enabled true
>  spark.dynamicAllocation.maxExecutors 10
>  spark.dynamicAllocation.minExecutors 2
>  spark.dynamicAllocation.initialExecutors 10
>  spark.dynamicAllocation.executorIdleTimeout 40s
> $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
> spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 
> --master yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11155) Stage summary json should include stage duration

2015-12-02 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-11155:

Attachment: Screen Shot 2015-12-02.png

sorry for being so slow... my first code change and it took me quite a while 
getting familiar with Spark terminology and structure took me quite a while.

I've submitted PR and please let me know if any change is needed. Thank you 
very much [~imranr] for your detailed guide for code change, it really helps me 
a lot!  :)

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Assignee: Xin Ren
>Priority: Minor
>  Labels: Starter
> Attachments: Screen Shot 2015-12-02.png
>
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12101:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix thread pools that cannot cache tasks in Worker and AppClient
> 
>
> Key: SPARK-12101
> URL: https://issues.apache.org/jira/browse/SPARK-12101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
> It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Fix Version/s: (was: 1.0.1)
   1.4.2

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1
>Reporter: Dan Dutrow
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Affects Version/s: 1.1.0
   1.2.0
   1.3.0
   1.4.0
   1.4.1

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1
>Reporter: Dan Dutrow
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12101:


 Summary: Fix thread pools that cannot cache tasks in Worker and 
AppClient
 Key: SPARK-12101
 URL: https://issues.apache.org/jira/browse/SPARK-12101
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.4.1, 1.6.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12102) In check analysis, data type check always

2015-12-02 Thread Yin Huai (JIRA)
Yin Huai created SPARK-12102:


 Summary: In check analysis, data type check always 
 Key: SPARK-12102
 URL: https://issues.apache.org/jira/browse/SPARK-12102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will see 
{{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 0) 
THEN 
struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
 as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
expressions should all be same type or coercible to a common type; line 1 pos 
85}}.

The problem is the nullability difference between {{4}} (non-nullable) and 
{{hash(4)}} (nullable).

Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12102:
-
Summary: Cast a non-nullable struct field to a nullable field during 
analysis  (was: In check analysis, data type check always )

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12000:
-
Target Version/s: 1.7.0  (was: 1.6.0)

> `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
> -
>
> Key: SPARK-12000
> URL: https://issues.apache.org/jira/browse/SPARK-12000
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Josh Rosen
>Priority: Blocker
>
> Reported by [~josephkb]. Not sure what is the root cause, but this is the 
> error message when I ran "sbt publishLocal":
> {code}
> [error] (launcher/compile:doc) javadoc returned nonzero exit code
> [error] (mllib/compile:doc) scala.reflect.internal.FatalError:
> [error]  while compiling: 
> /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala
> [error] during phase: global=terminal, atPhase=parser
> [error]  library version: version 2.10.5
> [error] compiler version: version 2.10.5
> [error]   reconstructed args: -Yno-self-type-checks -groups -classpath 
> 

[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong

2015-12-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036366#comment-15036366
 ] 

Thomas Graves commented on SPARK-11701:
---

this looks like a dup of SPARK-9038

> YARN - dynamic allocation and speculation active task accounting wrong
> --
>
> Key: SPARK-11701
> URL: https://issues.apache.org/jira/browse/SPARK-11701
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> I am using dynamic container allocation and speculation and am seeing issues 
> with the active task accounting.  The Executor UI still shows active tasks on 
> the an executor but the job/stage is all completed.  I think its also 
> affecting the dynamic allocation being able to release containers because it 
> thinks there are still tasks.
> Its easily reproduce by using spark-shell, turn on dynamic allocation, then 
> run just a wordcount on decent sized file and set the speculation parameters 
> low: 
>  spark.dynamicAllocation.enabled true
>  spark.shuffle.service.enabled true
>  spark.dynamicAllocation.maxExecutors 10
>  spark.dynamicAllocation.minExecutors 2
>  spark.dynamicAllocation.initialExecutors 10
>  spark.dynamicAllocation.executorIdleTimeout 40s
> $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
> spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 
> --master yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12101:
--
Priority: Minor  (was: Major)

> Fix thread pools that cannot cache tasks in Worker and AppClient
> 
>
> Key: SPARK-12101
> URL: https://issues.apache.org/jira/browse/SPARK-12101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
> It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12101:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix thread pools that cannot cache tasks in Worker and AppClient
> 
>
> Key: SPARK-12101
> URL: https://issues.apache.org/jira/browse/SPARK-12101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
> It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036516#comment-15036516
 ] 

Apache Spark commented on SPARK-12101:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10108

> Fix thread pools that cannot cache tasks in Worker and AppClient
> 
>
> Key: SPARK-12101
> URL: https://issues.apache.org/jira/browse/SPARK-12101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. 
> It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Description: 
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER))

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE




  was:
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER((

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE





> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Dan Dutrow
> Fix For: 1.0.1
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2015-12-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036470#comment-15036470
 ] 

Bryan Cutler commented on SPARK-11219:
--

I added an assessment of the current state of param descriptions for 
algorithms/models in pyspark.mllib.  To keep changes well separated, I will 
make sub-tasks for each Python file, except for FPM and Recommendation which 
are small and can probably be combined.

> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
>   * SVMModel - single line
>   * SVMWithSGD - vertical alignment, sporatic default values
>   * NaiveBayesModel - single line
>   * NaiveBayes - single line
> h4. Clustering
>   * KMeansModel - missing param description
>   * KMeans - missing param description and defaults
>   * GaussianMixture - vertical align, incorrect default formatting
>   * PowerIterationClustering - single line with wrapped indentation, missing 
> defaults
>   * StreamingKMeansModel - single line wrapped
>   * StreamingKMeans - single line wrapped, missing defaults
>   * LDAModel - single line
>   * LDA - vertical align, mising some defaults
> h4. FPM  
>   * FPGrowth - single line
>   * PrefixSpan - single line, defaults values in backticks
> h4. Recommendation
>   * ALS - does not have param descriptions
> h4. Regression
>   * LabeledPoint - single line
>   * LinearModel - single line
>   * LinearRegressionWithSGD - vertical alignment
>   * RidgeRegressionWithSGD - vertical align
>   * IsotonicRegressionModel - single line
>   * IsotonicRegression - single line, missing default
> h4. Tree
>   * DecisionTree - single line with vertical indentation, missing defaults
>   * RandomForest - single line with wrapped indent, missing some defaults
>   * GradientBoostedTrees - single line with wrapped indent
> NOTE
> This issue will just focus on model/algorithm descriptions, which are the 
> largest source of inconsistent formatting
> evaluation.py, feature.py, random.py, utils.py - these supporting classes 
> have param descriptions as single line, but are consistent so don't need to 
> be changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)
Dan Dutrow created SPARK-12103:
--

 Summary: KafkaUtils createStream with multiple topics -- does not 
work as expected
 Key: SPARK-12103
 URL: https://issues.apache.org/jira/browse/SPARK-12103
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Dan Dutrow
 Fix For: 1.0.1


Default way of creating stream out of Kafka source would be as

val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", 
Map("retarget" -> 2,"datapair" -> 2))

However, if two topics - in this case "retarget" and "datapair" - are very 
different, there is no way to set up different filter, mapping functions, etc), 
as they are effectively merged.

However, instance of KafkaInputDStream, created with this call internally calls 
ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, 
keyed by topic. It would be great if this map would be exposed somehow, so 
aforementioned call 

val streamS = KafkaUtils.createStreamS(...)

returned map of streams.

Regards,
Sergey Malov
Collective Media



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Description: 
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER((

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE




  was:
Default way of creating stream out of Kafka source would be as

val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", 
Map("retarget" -> 2,"datapair" -> 2))

However, if two topics - in this case "retarget" and "datapair" - are very 
different, there is no way to set up different filter, mapping functions, etc), 
as they are effectively merged.

However, instance of KafkaInputDStream, created with this call internally calls 
ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, 
keyed by topic. It would be great if this map would be exposed somehow, so 
aforementioned call 

val streamS = KafkaUtils.createStreamS(...)

returned map of streams.

Regards,
Sergey Malov
Collective Media


> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Dan Dutrow
> Fix For: 1.0.1
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER((
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Target Version/s: 1.4.2, 1.6.1  (was: 1.0.1)

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1
>Reporter: Dan Dutrow
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to pool resources. I have 10+ topics and don't want to 
> dedicate 10 cores to processing all of them. However, when reading the data 
> procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
> properly insert the topic name into the tuple. The left-key always null, 
> making it impossible to know what topic that data came from other than 
> stashing your key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035455#comment-15035455
 ] 

Apache Spark commented on SPARK-12088:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10095

> check connection.isClose before connection.getAutoCommit in JDBCRDD.close
> -
>
> Key: SPARK-12088
> URL: https://issues.apache.org/jira/browse/SPARK-12088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> in JDBCRDD, it has
> if (!conn.getAutoCommit && !conn.isClosed) {
> try {
>   conn.commit()
> } 
> . . . . . .
> In my test, the connection is already closed so conn.getAutoCommit throw 
> Exception. We should check !conn.isClosed before checking !conn.getAutoCommit 
> to avoid the Exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12093) Fix the error of comment in DDLParser

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12093:


Assignee: Apache Spark

> Fix the error of comment in DDLParser
> -
>
> Key: SPARK-12093
> URL: https://issues.apache.org/jira/browse/SPARK-12093
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yadong Qi
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12093) Fix the error of comment in DDLParser

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035462#comment-15035462
 ] 

Apache Spark commented on SPARK-12093:
--

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10096

> Fix the error of comment in DDLParser
> -
>
> Key: SPARK-12093
> URL: https://issues.apache.org/jira/browse/SPARK-12093
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yadong Qi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12093) Fix the error of comment in DDLParser

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12093:


Assignee: (was: Apache Spark)

> Fix the error of comment in DDLParser
> -
>
> Key: SPARK-12093
> URL: https://issues.apache.org/jira/browse/SPARK-12093
> Project: Spark
>  Issue Type: Documentation
>Reporter: Yadong Qi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import

2015-12-02 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480
 ] 

Bill Chambers commented on SPARK-11964:
---

quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new 
release [and the user guide]?

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12048) JDBCRDD calls close() twice - SQLite then throws an exception

2015-12-02 Thread R. H. (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035486#comment-15035486
 ] 

R. H. commented on SPARK-12048:
---

I have added the missing line, built a distribution, and tested it 
successfully: SQLite no longer throws the exception (and a test against 
PostgreSQL was OK as well).

The PR is in https://github.com/rh99/spark

- What else can I do with the PR?

> JDBCRDD calls close() twice - SQLite then throws an exception
> -
>
> Key: SPARK-12048
> URL: https://issues.apache.org/jira/browse/SPARK-12048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: R. H.
>Priority: Minor
>  Labels: starter
>
> The following code works:
> {quote}
>   val tableData = sqlContext.read.format("jdbc")
> .options(
>   Map(
> "url" -> "jdbc:sqlite:/tmp/test.db",
> "dbtable" -> "testtable")).load()
> {quote}
> but an exception gets reported. From the log:
> {quote}
> 15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection
> 15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement 
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database 
> (Connection is closed) at org.sqlite.core.DB.newSQLException(DB.java:890) at 
> org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109) at 
> org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35) at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
> {quote}
> So Spark succeeded to close the JDBC connection, and then it fails to close 
> the JDBC statement.
> Looking at the source, close() seems to be called twice: 
> {quote}
> context.addTaskCompletionListener\{ context => close() \}
> {quote}
> and in
> {quote}
> def hasNext
> {quote}
> If you look at the close() method (around line 443)
> {quote}
>   def close() {
> if (closed) return
> {quote}
> you can see that it checks the variable closed, but that value is never set 
> to true.
> So a trivial fix should be to set "closed = true" at the end of close().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035496#comment-15035496
 ] 

Apache Spark commented on SPARK-12067:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10097

> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.
> * Add Column.notnull as alias of Column.isnotnull following the pandas naming 
> convention.
> * Add DataFrame.isnotnull and DataFrame.notnull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column

2015-12-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12067:

Summary: Fix usage of isnan, isnull, isnotnull of Column  (was: Fix usage 
of isnan, isnull, isnotnull of Column and DataFrame)

> Fix usage of isnan, isnull, isnotnull of Column
> ---
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.-
> * -Add Column.notnull as alias of Column.isnotnull following the pandas 
> naming convention.-
> * -Add DataFrame.isnotnull and DataFrame.notnull.-
> * Add Column.isNaN to PySpark.
> * Add isnull, notnull, isnan as alias at Python side in order to be 
> compatible with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-12-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12067:

Description: 
* -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.-
* -Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.-
* -Add DataFrame.isnotnull and DataFrame.notnull.-

* Add Column.isNaN to PySpark.
* Add isnull, notnull, isnan as alias at Python side in order to be compatible 
with pandas.

  was:
* -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.-
* -Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.-
* -Add DataFrame.isnotnull and DataFrame.notnull.-

Add Column.isNaN to PySpark.
Add isnull, notnull, isnan as alias at Python side in order to be compatible 
with pandas.


> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.-
> * -Add Column.notnull as alias of Column.isnotnull following the pandas 
> naming convention.-
> * -Add DataFrame.isnotnull and DataFrame.notnull.-
> * Add Column.isNaN to PySpark.
> * Add isnull, notnull, isnan as alias at Python side in order to be 
> compatible with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12088:


Assignee: Apache Spark

> check connection.isClose before connection.getAutoCommit in JDBCRDD.close
> -
>
> Key: SPARK-12088
> URL: https://issues.apache.org/jira/browse/SPARK-12088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> in JDBCRDD, it has
> if (!conn.getAutoCommit && !conn.isClosed) {
> try {
>   conn.commit()
> } 
> . . . . . .
> In my test, the connection is already closed so conn.getAutoCommit throw 
> Exception. We should check !conn.isClosed before checking !conn.getAutoCommit 
> to avoid the Exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12088:


Assignee: (was: Apache Spark)

> check connection.isClose before connection.getAutoCommit in JDBCRDD.close
> -
>
> Key: SPARK-12088
> URL: https://issues.apache.org/jira/browse/SPARK-12088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> in JDBCRDD, it has
> if (!conn.getAutoCommit && !conn.isClosed) {
> try {
>   conn.commit()
> } 
> . . . . . .
> In my test, the connection is already closed so conn.getAutoCommit throw 
> Exception. We should check !conn.isClosed before checking !conn.getAutoCommit 
> to avoid the Exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12093) Fix the error of comment in DDLParser

2015-12-02 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-12093:
-

 Summary: Fix the error of comment in DDLParser
 Key: SPARK-12093
 URL: https://issues.apache.org/jira/browse/SPARK-12093
 Project: Spark
  Issue Type: Documentation
Reporter: Yadong Qi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import

2015-12-02 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480
 ] 

Bill Chambers edited comment on SPARK-11964 at 12/2/15 8:52 AM:


quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included, even those 
that are unresolved, in the new release [and the user guide]?


was (Author: bill_chambers):
quick question, am I to assume that all pieces mentioned in this jira: 
https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new 
release [and the user guide]?

> Create user guide section explaining export/import
> --
>
> Key: SPARK-11964
> URL: https://issues.apache.org/jira/browse/SPARK-11964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> I'm envisioning a single section in the main guide explaining how it works 
> with an example and noting major missing coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12094) Better format for query plan tree string

2015-12-02 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-12094:
--

 Summary: Better format for query plan tree string
 Key: SPARK-12094
 URL: https://issues.apache.org/jira/browse/SPARK-12094
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.7.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


When examine query plans of complex query plans with multiple joins, a pain 
point of mine is that, it's hard to immediately see the sibling node of a 
specific query plan node. For example:

{noformat}
TakeOrderedAndProject ...
 ConvertToSafe
  Project ...
   TungstenAggregate ...
TungstenExchange ...
 TungstenAggregate ...
  Project ...
   BroadcastHashJoin ...
Project ...
 BroadcastHashJoin ...
  Project ...
   BroadcastHashJoin ...
Scan ...
Filter ...
 Scan ...
  Scan ...
Project ...
 Filter ...
  Scan ...
{noformat}

Would be better to have tree lines to indicate relationships between plan nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12048) JDBCRDD calls close() twice - SQLite then throws an exception

2015-12-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035521#comment-15035521
 ] 

Sean Owen commented on SPARK-12048:
---

You need to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark [~rh99]

> JDBCRDD calls close() twice - SQLite then throws an exception
> -
>
> Key: SPARK-12048
> URL: https://issues.apache.org/jira/browse/SPARK-12048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: R. H.
>Priority: Minor
>  Labels: starter
>
> The following code works:
> {quote}
>   val tableData = sqlContext.read.format("jdbc")
> .options(
>   Map(
> "url" -> "jdbc:sqlite:/tmp/test.db",
> "dbtable" -> "testtable")).load()
> {quote}
> but an exception gets reported. From the log:
> {quote}
> 15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection
> 15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement 
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database 
> (Connection is closed) at org.sqlite.core.DB.newSQLException(DB.java:890) at 
> org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109) at 
> org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35) at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454)
> {quote}
> So Spark succeeded to close the JDBC connection, and then it fails to close 
> the JDBC statement.
> Looking at the source, close() seems to be called twice: 
> {quote}
> context.addTaskCompletionListener\{ context => close() \}
> {quote}
> and in
> {quote}
> def hasNext
> {quote}
> If you look at the close() method (around line 443)
> {quote}
>   def close() {
> if (closed) return
> {quote}
> you can see that it checks the variable closed, but that value is never set 
> to true.
> So a trivial fix should be to set "closed = true" at the end of close().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-12-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12067:

Description: 
-* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.
* Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.
* Add DataFrame.isnotnull and DataFrame.notnull.-

Add Column.isNaN to PySpark.
Add isnull, notnull, isnan as alias at Python side in order to be compatible 
with pandas.

  was:
* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.
* Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.
* Add DataFrame.isnotnull and DataFrame.notnull.


> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> -* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.
> * Add Column.notnull as alias of Column.isnotnull following the pandas naming 
> convention.
> * Add DataFrame.isnotnull and DataFrame.notnull.-
> Add Column.isNaN to PySpark.
> Add isnull, notnull, isnan as alias at Python side in order to be compatible 
> with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-12-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12067:

Description: 
* -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.-
* -Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.-
* -Add DataFrame.isnotnull and DataFrame.notnull.-

Add Column.isNaN to PySpark.
Add isnull, notnull, isnan as alias at Python side in order to be compatible 
with pandas.

  was:
-* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.
* Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.
* Add DataFrame.isnotnull and DataFrame.notnull.-

Add Column.isNaN to PySpark.
Add isnull, notnull, isnan as alias at Python side in order to be compatible 
with pandas.


> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.-
> * -Add Column.notnull as alias of Column.isnotnull following the pandas 
> naming convention.-
> * -Add DataFrame.isnotnull and DataFrame.notnull.-
> Add Column.isNaN to PySpark.
> Add isnull, notnull, isnan as alias at Python side in order to be compatible 
> with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12093) Fix the error of comment in DDLParser

2015-12-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12093:
--
   Priority: Trivial  (was: Major)
Component/s: Documentation

[~waterman] This probably isn't worth a JIRA. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and 
fill our JIRAs more carefully in the future. This has no component, is not 
"Major". Most importantly, this JIRA provides no information about the problem.

> Fix the error of comment in DDLParser
> -
>
> Key: SPARK-12093
> URL: https://issues.apache.org/jira/browse/SPARK-12093
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Yadong Qi
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12094) Better format for query plan tree string

2015-12-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12094:
---
Description: 
When examine plans of complex queries with multiple joins, a pain point of mine 
is that, it's hard to immediately see the sibling node of a specific query plan 
node. For example:

{noformat}
TakeOrderedAndProject ...
 ConvertToSafe
  Project ...
   TungstenAggregate ...
TungstenExchange ...
 TungstenAggregate ...
  Project ...
   BroadcastHashJoin ...
Project ...
 BroadcastHashJoin ...
  Project ...
   BroadcastHashJoin ...
Scan ...
Filter ...
 Scan ...
  Scan ...
Project ...
 Filter ...
  Scan ...
{noformat}

Would be better to have tree lines to indicate relationships between plan nodes.

  was:
When examine query plans of complex query plans with multiple joins, a pain 
point of mine is that, it's hard to immediately see the sibling node of a 
specific query plan node. For example:

{noformat}
TakeOrderedAndProject ...
 ConvertToSafe
  Project ...
   TungstenAggregate ...
TungstenExchange ...
 TungstenAggregate ...
  Project ...
   BroadcastHashJoin ...
Project ...
 BroadcastHashJoin ...
  Project ...
   BroadcastHashJoin ...
Scan ...
Filter ...
 Scan ...
  Scan ...
Project ...
 Filter ...
  Scan ...
{noformat}

Would be better to have tree lines to indicate relationships between plan nodes.


> Better format for query plan tree string
> 
>
> Key: SPARK-12094
> URL: https://issues.apache.org/jira/browse/SPARK-12094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> When examine plans of complex queries with multiple joins, a pain point of 
> mine is that, it's hard to immediately see the sibling node of a specific 
> query plan node. For example:
> {noformat}
> TakeOrderedAndProject ...
>  ConvertToSafe
>   Project ...
>TungstenAggregate ...
> TungstenExchange ...
>  TungstenAggregate ...
>   Project ...
>BroadcastHashJoin ...
> Project ...
>  BroadcastHashJoin ...
>   Project ...
>BroadcastHashJoin ...
> Scan ...
> Filter ...
>  Scan ...
>   Scan ...
> Project ...
>  Filter ...
>   Scan ...
> {noformat}
> Would be better to have tree lines to indicate relationships between plan 
> nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12080) Kryo - Support multiple user registrators

2015-12-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12080:
--
Affects Version/s: (was: 1.6.1)
   1.5.2

> Kryo - Support multiple user registrators
> -
>
> Key: SPARK-12080
> URL: https://issues.apache.org/jira/browse/SPARK-12080
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Rotem
>Priority: Minor
>  Labels: kryo, registrator, serializers
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Background: Currently when users need to have a custom serializer for their 
> registered classes, they use the user registrator of Kryo using the 
> spark.kryo.registrator configuration parameter.
> Problem: If the Spark user is an infrastructure itself, it may receive 
> multiple such registrators but won't be able to register them.
> Important note: Currently the single registrator supported can't reach any 
> state/configuration (it is instantiated by reflection with empty constructor)
> Using SparkEnv from user code isn't acceptable.
> Workaround:
> Create a wrapper registrator as a user, and have its implementation scan the 
> class path for multiple classes. 
> Caveat: this is inefficient and too complicated.
> Suggested solution - support multiple registrators + stay backward compatible
> Option 1:
> enhance the value of spark.kryo.registrator  to support a comma separated 
> list for class names. This will be backward compatible and won't add new 
> parameters. 
> Option 2:
> to be more logical, add spark.kryo.registrators new parameter, while keeping 
> the code handling the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown

2015-12-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036613#comment-15036613
 ] 

Marcelo Vanzin commented on SPARK-10911:


Ok, that makes sense. But since the YARN bug is the source of the problem here, 
the commit message (and maybe even the code) should mention it.

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12104) collect() does not handle multiple columns with same name

2015-12-02 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-12104:
--

 Summary: collect() does not handle multiple columns with same name
 Key: SPARK-12104
 URL: https://issues.apache.org/jira/browse/SPARK-12104
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Hossein Falaki
Priority: Critical


This is a regression from Spark 1.5

Spark can produce DataFrames with identical names (e.g., after left outer 
joins). In 1.5 when such a DataFrame was collected we ended up with an R 
data.frame with modified column names:

{code}
> names(mySparkDF)
[1] "date"   "name"   "name" 

> names(collect(mySparkDF))
[1] "date"   "name"   "name.1" 
{code}

But in 1.6 only the first column is included in the collected R data.frame. I 
think SparkR should continue the old behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036656#comment-15036656
 ] 

Apache Spark commented on SPARK-12106:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/10110

> Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-12106
> URL: https://issues.apache.org/jira/browse/SPARK-12106
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Burak Yavuz
>
> This test is still transiently flaky, because async methods can finish out of 
> order, and mess with the naming in the queue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12106:


Assignee: (was: Apache Spark)

> Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-12106
> URL: https://issues.apache.org/jira/browse/SPARK-12106
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Burak Yavuz
>
> This test is still transiently flaky, because async methods can finish out of 
> order, and mess with the naming in the queue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12103:
--
Affects Version/s: (was: 1.4.0)
   (was: 1.3.0)
   (was: 1.2.0)
   (was: 1.1.0)
   (was: 1.0.0)
 Target Version/s:   (was: 1.4.2, 1.6.1)
 Priority: Minor  (was: Major)
  Component/s: Documentation

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-12-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12106:


Assignee: Apache Spark

> Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-12106
> URL: https://issues.apache.org/jira/browse/SPARK-12106
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> This test is still transiently flaky, because async methods can finish out of 
> order, and mess with the naming in the queue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12107) Update spark-ec2 versions

2015-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12107:
-
Target Version/s: 1.6.0

> Update spark-ec2 versions
> -
>
> Key: SPARK-12107
> URL: https://issues.apache.org/jira/browse/SPARK-12107
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> spark-ec2's version strings are out-of-date. The latest versions of Spark 
> need to be reflected in its internal version maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator

2015-12-02 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036685#comment-15036685
 ] 

Min Qiu commented on SPARK-12085:
-

looks like the BooleanSimplification rule in Spark 1.5 provides a general way 
to rewrite the predicate expression. It should covers my cases. Will test the 
query on Spark 1.5.  

> The join condition hidden in DNF can't be pushed down to join operator 
> ---
>
> Key: SPARK-12085
> URL: https://issues.apache.org/jira/browse/SPARK-12085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Min Qiu
>
> TPC-H Q19:
> {quote}
> SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join 
> lineitem 
> WHERE ({color: red}p_partkey = l_partkey {color}
>AND p_brand = 'Brand#12' 
>AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') 
>AND l_quantity >= 1 AND l_quantity <= 1 + 10 
>AND p_size BETWEEN 1 AND 5 
>AND l_shipmode IN ('AIR', 'AIR REG') 
>AND l_shipinstruct = 'DELIVER IN PERSON') 
>OR ({color: red}p_partkey = l_partkey{color}
>AND p_brand = 'Brand#23' 
>AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') 
>AND l_quantity >= 10 AND l_quantity <= 10 + 10 
>AND p_size BETWEEN 1 AND 10 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON') 
>OR ({color: red}p_partkey = l_partkey{color} 
>AND p_brand = 'Brand#34' 
>AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') 
>AND l_quantity >= 20 AND l_quantity <= 20 + 10 
>AND p_size BETWEEN 1 AND 15 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON')
> {quote}
> The equality condition {color:red} p_partkey = l_partkey{color} matches the 
> join relations but it cannot be recogized by optimizer because it's hidden in 
> a disjunctive normal form.  As a result the entire where clause will be in a 
> filter operator on top of the join operator where the join condition would be 
> "None" in the optimized plan. Finally the query planner will apply a 
> prohibitive expensive cartesian product on the physical plan which causes OOM 
> exception or very bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036710#comment-15036710
 ] 

Xiangrui Meng commented on SPARK-8517:
--

* I'm not sure whether the mathematical formulation is helpful or not. They 
might be useful to explain the parameters but it seems unnecessary for us to 
explain how the models work. I'm okay with copying the content.
* We don't have linear SVM in spark.ml.
* +1 on moving perceptron classifier to classification.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12082) NettyBlockTransferSecuritySuite "security mismatch auth off on client" test is flaky

2015-12-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036738#comment-15036738
 ] 

Marcelo Vanzin commented on SPARK-12082:


If the build machines run on VMs, run multiple jobs, or have anything else that 
may lead to sporadic stalls / slowness, I wouldn't be surprised when timeouts 
turn out to be unreliable.

Our internal test machines are way more loaded than those used by Spark's 
jenkins, and we run into a bunch of different time outs that the Spark builds 
never see.

> NettyBlockTransferSecuritySuite "security mismatch auth off on client" test 
> is flaky
> 
>
> Key: SPARK-12082
> URL: https://issues.apache.org/jira/browse/SPARK-12082
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Josh Rosen
>  Labels: flaky-test
>
> The NettyBlockTransferSecuritySuite "security mismatch auth off on client" 
> test is flaky in Jenkins. Here's a link to a report that lists the latest 
> failures over the past week+:
> https://spark-tests.appspot.com/tests/org.apache.spark.network.netty.NettyBlockTransferSecuritySuite/security%20mismatch%20auth%20off%20on%20client#latest-failures
> In all of these failures, the test failed with the following exception:
> {code}
> Futures timed out after [1000 milliseconds]
>   java.util.concurrent.TimeoutException: Futures timed out after [1000 
> milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.ready(package.scala:86)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.fetchBlock(NettyBlockTransferSecuritySuite.scala:151)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.org$apache$spark$network$netty$NettyBlockTransferSecuritySuite$$testConnection(NettyBlockTransferSecuritySuite.scala:116)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply$mcV$sp(NettyBlockTransferSecuritySuite.scala:90)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 

[jira] [Commented] (SPARK-12082) NettyBlockTransferSecuritySuite "security mismatch auth off on client" test is flaky

2015-12-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036743#comment-15036743
 ] 

Josh Rosen commented on SPARK-12082:


For now, I'm going to try bumping the timeout to something higher, like 10 
seconds, and keep an eye on the test dashboard to see if that's a sufficient 
fix.

> NettyBlockTransferSecuritySuite "security mismatch auth off on client" test 
> is flaky
> 
>
> Key: SPARK-12082
> URL: https://issues.apache.org/jira/browse/SPARK-12082
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Josh Rosen
>  Labels: flaky-test
>
> The NettyBlockTransferSecuritySuite "security mismatch auth off on client" 
> test is flaky in Jenkins. Here's a link to a report that lists the latest 
> failures over the past week+:
> https://spark-tests.appspot.com/tests/org.apache.spark.network.netty.NettyBlockTransferSecuritySuite/security%20mismatch%20auth%20off%20on%20client#latest-failures
> In all of these failures, the test failed with the following exception:
> {code}
> Futures timed out after [1000 milliseconds]
>   java.util.concurrent.TimeoutException: Futures timed out after [1000 
> milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.ready(package.scala:86)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.fetchBlock(NettyBlockTransferSecuritySuite.scala:151)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.org$apache$spark$network$netty$NettyBlockTransferSecuritySuite$$testConnection(NettyBlockTransferSecuritySuite.scala:116)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply$mcV$sp(NettyBlockTransferSecuritySuite.scala:90)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 

[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036807#comment-15036807
 ] 

Davies Liu commented on SPARK-12089:


This query will not generate huge record, each record should be less than 100B.

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Critical
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11992) Severl numbers in my spark shell (pyspark)

2015-12-02 Thread Alberto Bonsanto (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Bonsanto closed SPARK-11992.


> Severl numbers in my spark shell (pyspark)
> --
>
> Key: SPARK-11992
> URL: https://issues.apache.org/jira/browse/SPARK-11992
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.2
> Environment: Linux Ubuntu 14.04 LTS
> Jupyter 
> Spark 1.5.2
>Reporter: Alberto Bonsanto
>Priority: Critical
>  Labels: newbie
>
> The problem is very weird, I am currently trying to fit some classifiers from 
> mllib library (SVM, LogisticRegression, RandomForest, DecisionTree and 
> NaiveBayes), so they might classify the data properly, I am trying to compare 
> their performances evaluating their predictions using my current validation 
> data (the typical pipeline), and the problem is that when I try to fit any of 
> those, my spark-shell console prints millions and millions of entries, and 
> after that the fitting process gets stopped, you can see it 
> [here|http://i.imgur.com/mohLnwr.png]
> Some details:
> - My data has around 15M of entries.
> - I use LabeledPoints to represent each entry, where the features are 
> SparseVectors and they have *104* features or dimensions.
> - I don't show many things in the console, 
> [log4j.properties|https://gist.github.com/Bonsanto/c487624db805f56882b8]
> - The program is running locally in a computer with 16GB of RAM. 
> I have already asked this, in StackOverflow, you can see it here [Crazy 
> print|http://stackoverflow.com/questions/33807347/pyspark-shell-outputs-several-numbers-instead-of-the-loading-arrow]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-12103:
---
Description: 
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to use one (or a few) Kafka Streaming Receiver to pool 
resources. I have 10+ topics and don't want to dedicate 10 cores to processing 
all of them. However, when reading the data procuced by 
KafkaUtils.createStream, the DStream[(String,String)] does not properly insert 
the topic name into the tuple. The left-key always null, making it impossible 
to know what topic that data came from other than stashing your key into the 
value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER))

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE




  was:
(Note: yes, there is a Direct API that may be better, but it's not the easiest 
thing to get started with. The Kafka Receiver API still needs to work, 
especially for newcomers)

When creating a receiver stream using KafkaUtils, there is a valid use case 
where you would want to pool resources. I have 10+ topics and don't want to 
dedicate 10 cores to processing all of them. However, when reading the data 
procuced by KafkaUtils.createStream, the DStream[(String,String)] does not 
properly insert the topic name into the tuple. The left-key always null, making 
it impossible to know what topic that data came from other than stashing your 
key into the value.  Is there a way around that problem?

 CODE

val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
"topicE" -> 1, "topicF" -> 1, ...)

val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i 
=>
  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, consumerProperties,
topics,
StorageLevel.MEMORY_ONLY_SER))

val unioned :DStream[(String,String)] = ssc.union(streams)

unioned.flatMap(x => {

   val (key, value) = x
  // key is always null!
  // value has data from any one of my topics

  key match ... {
  ..
  }

}

 END CODE





> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1
>Reporter: Dan Dutrow
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036690#comment-15036690
 ] 

Xiangrui Meng commented on SPARK-8517:
--

* Agree that the focus of spark.ml should not only be pipeline but also 
DataFrames.
* +1 on reorganizing the menu spark.ml menu based on the goal.
* Users should be able to use individual algorithms under spark.ml. There are 
some missing features, but this is not part of this JIRA.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-02 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036741#comment-15036741
 ] 

Dan Dutrow commented on SPARK-12103:


One possible way around this problem would be to stick the topic name into the 
message key. Unless the message key is supposed to be unique, then this would 
allow you to retrieve the topic name from the data.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1
>Reporter: Dan Dutrow
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12104) collect() does not handle multiple columns with same name

2015-12-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036791#comment-15036791
 ] 

Shivaram Venkataraman commented on SPARK-12104:
---

Any ideas what caused this ? 

> collect() does not handle multiple columns with same name
> -
>
> Key: SPARK-12104
> URL: https://issues.apache.org/jira/browse/SPARK-12104
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Hossein Falaki
>Priority: Critical
>
> This is a regression from Spark 1.5
> Spark can produce DataFrames with identical names (e.g., after left outer 
> joins). In 1.5 when such a DataFrame was collected we ended up with an R 
> data.frame with modified column names:
> {code}
> > names(mySparkDF)
> [1] "date"   "name"   "name" 
> > names(collect(mySparkDF))
> [1] "date"   "name"   "name.1" 
> {code}
> But in 1.6 only the first column is included in the collected R data.frame. I 
> think SparkR should continue the old behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12110:
---

 Summary: spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  
Exception: ("You must build Spark with Hive 
 Key: SPARK-12110
 URL: https://issues.apache.org/jira/browse/SPARK-12110
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark, SQL
Affects Versions: 1.5.1
 Environment: cluster created using 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
Reporter: Andrew Davidson


I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I can not run the tokenizer sample code. Is there a 
work around?

Kind regards

Andy

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
658 raise Exception("You must build Spark with Hive. "
659 "Export 'SPARK_HIVE=true' and run "
--> 660 "build/sbt assembly", e)
661 
662 def _get_hive_ctx(self):

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError('An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))




http://spark.apache.org/docs/latest/ml-features.html#tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer

sentenceDataFrame = sqlContext.createDataFrame([
  (0, "Hi I heard about Spark"),
  (1, "I wish Java could use case classes"),
  (2, "Logistic,regression,models,are,neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
for words_label in wordsDataFrame.select("words", "label").take(3):
  print(words_label)

---
Py4JJavaError Traceback (most recent call last)
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
654 if not hasattr(self, '_scala_HiveContext'):
--> 655 self._scala_HiveContext = self._get_hive_ctx()
656 return self._scala_HiveContext

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
662 def _get_hive_ctx(self):
--> 663 return self._jvm.HiveContext(self._jsc.sc())
664 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
700 return_value = get_return_value(answer, self._gateway_client, 
None,
--> 701 self._fqn)
702 

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.io.IOException: Filesystem closed
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at 

[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-02 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036809#comment-15036809
 ] 

shane knapp commented on SPARK-11255:
-

ok, [~shivaram] and i got together and whipped up a test build on our staging 
node w/R 3.1.1, and it looks like we're good!

https://amplab.cs.berkeley.edu/jenkins/view/AMPLab%20Infra/job/SparkR-testing/2/

[jenkins@amp-jenkins-staging-01 ~]# R --version | head -1
R version 3.1.1 (2014-07-10) -- "Sock it to Me"

i'll need to schedule some jenkins-wide downtime (before EOY) to remove all 
traces of the old R, and install 3.1.1 and recompile any additional packages.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2015-12-02 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-12105:


 Summary: Add a DataFrame.show() with argument for output 
PrintStream
 Key: SPARK-12105
 URL: https://issues.apache.org/jira/browse/SPARK-12105
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.2
Reporter: Dean Wampler
Priority: Minor


It would be nice to send the output of DataFrame.show(...) to a different 
output stream than stdout, including just capturing the string itself. This is 
useful, e.g., for testing. Actually, it would be sufficient and perhaps better 
to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2015-12-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036683#comment-15036683
 ] 

Michael Armbrust commented on SPARK-12066:
--

Can you reproduce this on 1.6-rc1?

> spark sql  throw java.lang.ArrayIndexOutOfBoundsException when use table.* 
> with join 
> -
>
> Key: SPARK-12066
> URL: https://issues.apache.org/jira/browse/SPARK-12066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.2
> Environment: linux 
>Reporter: Ricky Yang
>Priority: Blocker
>
> throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark 
> sql on spark standlone or yarn.
>the sql:
> select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> But ,the result is correct when using no * as following:
> select ta.sale_dt 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> standlone version is 1.4.0 and version spark on yarn  is 1.5.2
> error log :
>   
> 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ] 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException 
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
> at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) 
> at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) 
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
>  
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) 
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) 
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) 
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> 

[jira] [Updated] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-12-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12106:
---
Labels: flaky-test  (was: )

> Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-12106
> URL: https://issues.apache.org/jira/browse/SPARK-12106
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Burak Yavuz
>  Labels: flaky-test
>
> This test is still transiently flaky, because async methods can finish out of 
> order, and mess with the naming in the queue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036718#comment-15036718
 ] 

Xiangrui Meng commented on SPARK-8517:
--

[~timhunter] I agree with most of your points. I'd recommend the following 
steps:

1. Reorganize the content of spark.ml guide based on the goals, but do not 
introduce new content if possible.
2. Create JIRAs for the rest tasks, which could be done in parallel, including:
  * spark.ml branding
  * move model selection / cross validation to tuning
  * enhance guide for individual algorithms, etc

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >