[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Selin updated SPARK-12089: --- Description: When running a large spark sql query including multiple joins I see tasks failing with the following trace: {code} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} >From the spark code it looks like this is due to a integer overflow when >growing a buffer length. The offending line {{BufferHolder.java:36}} is the >following in the version I'm running: {code} final byte[] tmp = new byte[length * 2]; {code} This seems to indicate to me that this buffer will never be able to hold more then 2G worth of data. And likely will hold even less since any length > 1073741824 will cause a integer overflow and turn the new buffer size negative. I hope I'm simply missing some critical config setting but it still seems weird that we have a (rather low) upper limit on these buffers. was: When running a large spark sql query including multiple joins I see tasks failing with the following trace: {code} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} >From the spark code it looks like this is due to a integer overflow when >growing a buffer length. The offending line {{BufferHolder.java:36}} is the >following in the version I'm running: {code} final byte[] tmp = new byte[length * 2]; {code} This seems to indicate to me that this buffer will never be able to hold more then 2G worth of data. And likely will hold even less since any length > 1073741824 will cause a integer overflow and turn negative once we double it. I'm still digging down to try to pin point what is actually responsible for managing how big this should be grown. I figure we cannot simply add some check here to keep it from going negative since arguably we do need the buffer to grow this big? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL:
[jira] [Comment Edited] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015 ] Erik Selin edited comment on SPARK-12089 at 12/2/15 4:15 PM: - I can make that change if it is that easy. I'm just wondering if that's really enough? What would happen when we need a buffer to hold more then Integer.Max? was (Author: tyro89): I can make that change if it is that easy. I'm just wondering if that's really enough? What would happen if we actually need a buffer to hold more then Integer.Max? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015 ] Erik Selin edited comment on SPARK-12089 at 12/2/15 4:07 PM: - I can make that change if it is that easy. I'm just wondering if that's really enough? What would happen if we actually need a buffer to hold more then Integer.Max? was (Author: tyro89): I can make that change if it is that easy. I'm just wondering if that's really enough? What would happen if we actually need this buffer to hold more then Integer.Max? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12096) remove the old constraint in word2vec
[ https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12096: Assignee: Apache Spark > remove the old constraint in word2vec > - > > Key: SPARK-12096 > URL: https://issues.apache.org/jira/browse/SPARK-12096 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2 >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > word2vec now can handle much bigger vocabulary. > The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036015#comment-15036015 ] Erik Selin commented on SPARK-12089: I can make that change if it is that easy. I'm just wondering if that's really enough? What would happen if we actually need this buffer to hold more then Integer.Max? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036004#comment-15036004 ] Wenchen Fan commented on SPARK-12089: - yea, I think we should be more conservative to grow the buffer array, maybe we can check the `length`, if `length > Integer.Max / 2`, we should just use `Integer.Max` as new array size, instead of `length * 2`. cc [~davies] > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035873#comment-15035873 ] Christoph Pirkl commented on SPARK-10969: - While this commit is useful it does not fix this issue. In order to fix this, it would be best to introduce a serializable parameter object that contains aws credentials for kinesis, dynamodb and cloudwatch (and maybe other values). This would make it easier to add more parameters later. > Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis > and DynamoDB > --- > > Key: SPARK-10969 > URL: https://issues.apache.org/jira/browse/SPARK-10969 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Christoph Pirkl >Priority: Critical > > {{KinesisUtils.createStream()}} allows specifying only one set of AWS > credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB > and CloudWatch. > h5. Motivation > In a scenario where one needs to read from a Kinesis Stream owned by a > different AWS account the user usually has minimal rights (i.e. only read > from the stream). In this case creating the DynamoDB table in KCL will fail. > h5. Proposal > My proposed solution would be to allow specifying multiple credentials in > {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The > additional credentials could then be passed to the constructor of > {{KinesisClientLibConfiguration}} or method > {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035963#comment-15035963 ] Brian London commented on SPARK-10969: -- I'm not sure why it needs to be an object as opposed to two strings, or is the issue that this change only adds custom credentials to kinesis and not dynamodb and cloudwatch? There are already the `fs.s3.awsAccessKeyId` and `fs.s3.awsSecretAccessKey` parameters that are used for the `.saveToTextFile` RDD method. Perhaps something like that could be used for the other services as well. > Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis > and DynamoDB > --- > > Key: SPARK-10969 > URL: https://issues.apache.org/jira/browse/SPARK-10969 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Christoph Pirkl >Priority: Critical > > {{KinesisUtils.createStream()}} allows specifying only one set of AWS > credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB > and CloudWatch. > h5. Motivation > In a scenario where one needs to read from a Kinesis Stream owned by a > different AWS account the user usually has minimal rights (i.e. only read > from the stream). In this case creating the DynamoDB table in KCL will fail. > h5. Proposal > My proposed solution would be to allow specifying multiple credentials in > {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The > additional credentials could then be passed to the constructor of > {{KinesisClientLibConfiguration}} or method > {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035963#comment-15035963 ] Brian London edited comment on SPARK-10969 at 12/2/15 3:31 PM: --- I'm not sure why it needs to be an object as opposed to two strings, or is the issue that this change only adds custom credentials to kinesis and not dynamodb and cloudwatch? There are already the `fs.s3.awsAccessKeyId` and `fs.s3.awsSecretAccessKey` parameters that are used for the `.saveToTextFile` RDD method. Perhaps something like that could be used for the other services as well. was (Author: brianlondon): I'm not sure why it needs to be an object as opposed to two strings, or is the issue that this change only adds custom credentials to kinesis and not dynamodb and cloudwatch? There are already the `fs.s3.awsAccessKeyId` and `fs.s3.awsSecretAccessKey` parameters that are used for the `.saveToTextFile` RDD method. Perhaps something like that could be used for the other services as well. > Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis > and DynamoDB > --- > > Key: SPARK-10969 > URL: https://issues.apache.org/jira/browse/SPARK-10969 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Christoph Pirkl >Priority: Critical > > {{KinesisUtils.createStream()}} allows specifying only one set of AWS > credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB > and CloudWatch. > h5. Motivation > In a scenario where one needs to read from a Kinesis Stream owned by a > different AWS account the user usually has minimal rights (i.e. only read > from the stream). In this case creating the DynamoDB table in KCL will fail. > h5. Proposal > My proposed solution would be to allow specifying multiple credentials in > {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The > additional credentials could then be passed to the constructor of > {{KinesisClientLibConfiguration}} or method > {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12098) Cross validator with multi-arm bandit search
Xusen Yin created SPARK-12098: - Summary: Cross validator with multi-arm bandit search Key: SPARK-12098 URL: https://issues.apache.org/jira/browse/SPARK-12098 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xusen Yin The classic cross-validation requires all inner classifiers iterate to a fixed number of iterations, or until convergence states. It is costly especially in the massive data scenario. According to the paper Non-stochastic Best Arm Identification and Hyperparameter Optimization (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce the amount of total iterations of cross-validation with multi-armed bandit search. The multi-armed bandit search for cross-validation (bandit search for short) requires warm-start of ml algorithms, and fine-grained control of the inner behavior of the corss validator. Since there are bunch of algorithms of bandit search to find the best parameter set, we intent to provide only a few of them in the beginning to reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12096) remove the old constraint in word2vec
[ https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12096: Assignee: (was: Apache Spark) > remove the old constraint in word2vec > - > > Key: SPARK-12096 > URL: https://issues.apache.org/jira/browse/SPARK-12096 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2 >Reporter: yuhao yang >Priority: Minor > > word2vec now can handle much bigger vocabulary. > The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12096) remove the old constraint in word2vec
[ https://issues.apache.org/jira/browse/SPARK-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035864#comment-15035864 ] Apache Spark commented on SPARK-12096: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/10103 > remove the old constraint in word2vec > - > > Key: SPARK-12096 > URL: https://issues.apache.org/jira/browse/SPARK-12096 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2 >Reporter: yuhao yang >Priority: Minor > > word2vec now can handle much bigger vocabulary. > The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Selin updated SPARK-12089: --- Description: When running a large spark sql query including multiple joins I see tasks failing with the following trace: {code} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} >From the spark code it looks like this is due to a integer overflow when >growing a buffer length. The offending line {{BufferHolder.java:36}} is the >following in the version I'm running: {code} final byte[] tmp = new byte[length * 2]; {code} This seems to indicate to me that this buffer will never be able to hold more then 2G worth of data. And likely will hold even less since any length > 1073741824 will cause a integer overflow and turn negative once we double it. I'm still digging down to try to pin point what is actually responsible for managing how big this should be grown. I figure we cannot simply add some check here to keep it from going negative since arguably we do need the buffer to grow this big? was: When running a rather large spark sql query including multiple joins I see tasks failing with the following trace: {code} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) at org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} >From the code it looks like this is due to a doubling of a target length that >in my case makes the new buffer length flip to negative due to what i assume >is just too much data? The offending line {{BufferHolder.java:36}} is the following in the version I'm running: {code} final byte[] tmp = new byte[length * 2]; {code} I'm still digging down to try to pin point what is actually responsible for managing how big this should be grown. I figure we cannot simply add some check here to keep it from going negative since arguably we do need the buffer to grow this big? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL:
[jira] [Created] (SPARK-12097) How to do a cached, batched JDBC-lookup in Spark Streaming?
Christian Kurz created SPARK-12097: -- Summary: How to do a cached, batched JDBC-lookup in Spark Streaming? Key: SPARK-12097 URL: https://issues.apache.org/jira/browse/SPARK-12097 Project: Spark Issue Type: Brainstorming Components: Streaming Reporter: Christian Kurz h3. Use-case I need to enrich incoming Kafka data with data from a lookup table (or query) on a relational database. Lookup data is changing slowly over time (So caching is okay for a certain retention time). Lookup data is potentially huge (So loading all data upfront is not option). h3. Problem The overall design idea is to implement a cached and batched JDBC lookup. That is, for any lookup keys, which are missing from the lookup cache, a JDBC lookup is done to retrieve the missing lookup data. JDBC lookups are rather expensive (connection overhead, number of round-trips) and therefore must be done in batches. E.g. one JDBC lookup per 100 missing keys. So the high-level logic might look something like this: # For every Kafka RDD we extract all lookup keys # For all lookup keys we check whether the lookup data is already available already in cache and whether this cached information has not expired, yet. # For any lookup keys not found in cache (or expired), we send batched prepared JDBC Statements to the database to fetch the missing lookup data: {{SELECT c1, c2, c3 FROM ... WHERE k1 in (?,?,?,...)}} to minimize the number of JDBC round-trips. # At this point we have up-to-date lookup data for all lookup keys and can perform the actual lookup operation. Does this approach make sense on Spark? Would Spark State DStreams be the right way to go? Or other design approaches? Assuming Spark State DStreams are the right direction, the low-level question is how to do the batching? Would this particular signature of DStream.updateStateByKey ( iterator-> iterator): {code:borderStyle=solid} def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean, initialRDD: RDD[(K, S)] ) {code} be the right way to batch multiple incoming keys into a single JDBC-lookup query? Would the new {{DStream.trackStateByKey()}} be a better approach? The second more high-level question: is there a way to chain multiple state operations on the same state object? Going with the above design approach the entire lookup logic would be handcrafted into some java/scala/python {{updateFunc}}. This function would go over all incoming keys, check which ones are missing from cache, batch the missing ones, run the JDBC queries and union the returned lookup data with the existing cache from the State object. The fact that all of this must be handcrafted into a single function seems to be caused by the fact that Spark State processing logic on a high-level works like this: {code:borderStyle=solid} input: prevStateRdd, inputRDD output: updateStateFunc( prevStateRdd, inputRdd )}} {code} So only a single updateStateFunc operating on prevStateRdd and inputRdd in one go. Once done there is no way to further refine the State as part of the current micro batch. The multi-step processing required here sounds like a typical use-case for a DStream: apply multiple operations one after the other on some incoming data. So I wonder whether there is a way to extend the concept of state processing (may be it already has been extended?) to do something like: {code:borderStyle=solid} *input: prevStateRdd, inputRdd* missingKeys = inputRdd.filter( ) foundKeys = inputRdd.filter( ) newLookupData = lookupKeysUsingJdbcDataFrameRead( missingKeys.collect() ) newStateRdd = newLookupData.union( foundKeys).union( prevStateRdd ) *output: newStateRdd* {code} This would nicely leverage all the power and richness of Spark. The only missing bit - and the reason why this approach does not work today (based on my naive understanding of Spark - is that {{newStateRdd}} cannot be declared to be the {{prevStateRdd}} of the next micro batch. If Spark had a way of declaring an RDD (or DStream to be the parent for the next batch run), even complex (chained) state operations would be easy to describe and would not require hand-written Java/Python/Scala updateFunctions. Thanks a lot for taking the time to read all of this!!! Any thoughts/pointers are much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036337#comment-15036337 ] holdenk commented on SPARK-12040: - Working on this :) > Add toJson/fromJson to Vector/Vectors for PySpark > - > > Key: SPARK-12040 > URL: https://issues.apache.org/jira/browse/SPARK-12040 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Trivial > Labels: starter > > Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one > SPARK-11766. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12094) Better format for query plan tree string
[ https://issues.apache.org/jira/browse/SPARK-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12094. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 10099 [https://github.com/apache/spark/pull/10099] > Better format for query plan tree string > > > Key: SPARK-12094 > URL: https://issues.apache.org/jira/browse/SPARK-12094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.7.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.6.0 > > > When examine plans of complex queries with multiple joins, a pain point of > mine is that, it's hard to immediately see the sibling node of a specific > query plan node. For example: > {noformat} > TakeOrderedAndProject ... > ConvertToSafe > Project ... >TungstenAggregate ... > TungstenExchange ... > TungstenAggregate ... > Project ... >BroadcastHashJoin ... > Project ... > BroadcastHashJoin ... > Project ... >BroadcastHashJoin ... > Scan ... > Filter ... > Scan ... > Scan ... > Project ... > Filter ... > Scan ... > {noformat} > Would be better to have tree lines to indicate relationships between plan > nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12089: --- Priority: Critical (was: Major) > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12097) How to do a cached, batched JDBC-lookup in Spark Streaming?
[ https://issues.apache.org/jira/browse/SPARK-12097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036116#comment-15036116 ] Sean Owen commented on SPARK-12097: --- Normally I'd say we don't use JIRA for discussion (i.e. we don't use the Brainstorming type and can't delete it), but since the writeup here is good, and I have some back-story on the use case here which makes me believe it _could_ lead to a change, let's leave it for the moment. Why not simply query once per batch for all the rows you need and cache them? this need not involve anything specific to Spark. {{updateStateByKey}} is more for when you want Spark to maintain some persistent data, but you are already maintaining it (in your RDBMS). It's unnecessarily complex here. > How to do a cached, batched JDBC-lookup in Spark Streaming? > --- > > Key: SPARK-12097 > URL: https://issues.apache.org/jira/browse/SPARK-12097 > Project: Spark > Issue Type: Brainstorming > Components: Streaming >Reporter: Christian Kurz > > h3. Use-case > I need to enrich incoming Kafka data with data from a lookup table (or query) > on a relational database. Lookup data is changing slowly over time (So > caching is okay for a certain retention time). Lookup data is potentially > huge (So loading all data upfront is not option). > h3. Problem > The overall design idea is to implement a cached and batched JDBC lookup. > That is, for any lookup keys, which are missing from the lookup cache, a JDBC > lookup is done to retrieve the missing lookup data. JDBC lookups are rather > expensive (connection overhead, number of round-trips) and therefore must be > done in batches. E.g. one JDBC lookup per 100 missing keys. > So the high-level logic might look something like this: > # For every Kafka RDD we extract all lookup keys > # For all lookup keys we check whether the lookup data is already available > already in cache and whether this cached information has not expired, yet. > # For any lookup keys not found in cache (or expired), we send batched > prepared JDBC Statements to the database to fetch the missing lookup data: > {{SELECT c1, c2, c3 FROM ... WHERE k1 in (?,?,?,...)}} > to minimize the number of JDBC round-trips. > # At this point we have up-to-date lookup data for all lookup keys and can > perform the actual lookup operation. > Does this approach make sense on Spark? Would Spark State DStreams be the > right way to go? Or other design approaches? > Assuming Spark State DStreams are the right direction, the low-level question > is how to do the batching? > Would this particular signature of DStream.updateStateByKey ( iterator-> > iterator): > {code:borderStyle=solid} > def updateStateByKey[S: ClassTag]( > updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], > partitioner: Partitioner, > rememberPartitioner: Boolean, > initialRDD: RDD[(K, S)] > ) > {code} > be the right way to batch multiple incoming keys into a single JDBC-lookup > query? > Would the new {{DStream.trackStateByKey()}} be a better approach? > The second more high-level question: is there a way to chain multiple state > operations on the same state object? > Going with the above design approach the entire lookup logic would be > handcrafted into some java/scala/python {{updateFunc}}. This function would > go over all incoming keys, check which ones are missing from cache, batch the > missing ones, run the JDBC queries and union the returned lookup data with > the existing cache from the State object. > The fact that all of this must be handcrafted into a single function seems to > be caused by the fact that Spark State processing logic on a high-level works > like this: > {code:borderStyle=solid} > input: prevStateRdd, inputRDD > output: updateStateFunc( prevStateRdd, inputRdd )}} > {code} > So only a single updateStateFunc operating on prevStateRdd and inputRdd in > one go. Once done there is no way to further refine the State as part of the > current micro batch. > The multi-step processing required here sounds like a typical use-case for a > DStream: apply multiple operations one after the other on some incoming data. > So I wonder whether there is a way to extend the concept of state processing > (may be it already has been extended?) to do something like: > {code:borderStyle=solid} > *input: prevStateRdd, inputRdd* > missingKeys = inputRdd.filter( ) > foundKeys = inputRdd.filter( ) > newLookupData = lookupKeysUsingJdbcDataFrameRead( missingKeys.collect() ) > newStateRdd = newLookupData.union( foundKeys).union( prevStateRdd ) > *output: newStateRdd* > {code} > This would nicely leverage all the power and richness of Spark. The only > missing bit - and the reason why this approach does not work today
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036194#comment-15036194 ] Davies Liu commented on SPARK-12089: Is it possible that you have a record larger than 1G? I don't think so, or there are many places will break. Does this task always fail or not? If not, it means the length get corrupt somewhere, that's the root cause. Can you reproduce this? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12099) Standalone and Mesos Should use OnOutOfMemoryError handlers
Imran Rashid created SPARK-12099: Summary: Standalone and Mesos Should use OnOutOfMemoryError handlers Key: SPARK-12099 URL: https://issues.apache.org/jira/browse/SPARK-12099 Project: Spark Issue Type: Improvement Components: Deploy, Mesos Affects Versions: 1.6.0 Reporter: Imran Rashid cc [~andrewor14] [~tnachen] On SPARK-11801, we've been discussing the use of {{OnOutMemoryError}} to terminate the jvm in yarn mode. There seems to be consensus that this is indeed the right thing to do. I assume that the other cluster managers should also be doing the same thing. Though maybe there is a good reason for not including it under standalone & mesos mode (or perhaps this is already happening via some other mechanism I'm not seeing). In any case, I thought it was worth drawing your attention to it, I didn't see this discussed in any previous issue. (Note that there are currently some drawbacks to using {{OnOutOfMemoryError}}, in that you get some confusing msgs, but hopefully SPARK-11801 will address that.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036247#comment-15036247 ] Erik Selin commented on SPARK-12089: There shouldn't be a single record larger than 1G no. But I'm doing a group by month so I was guessing that a large amount of pre-aggregated data might end up in the same buffer? I have tasks failing and later succeeding and then there's others that seems to just kill the job. I'm not deeply familiar with this area of spark but I guess that does mean that we might be failing to reclaim parts of the buffer for later use? I can reproduce but it takes me ~ 4 hours to get to the failing stage :) > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036295#comment-15036295 ] Davies Liu commented on SPARK-12089: [~tyro89] Are you build a large Array using group by? How is the query looks like? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12100) bug in spark/python/pyspark/rdd.py portable_hash()
Andrew Davidson created SPARK-12100: --- Summary: bug in spark/python/pyspark/rdd.py portable_hash() Key: SPARK-12100 URL: https://issues.apache.org/jira/browse/SPARK-12100 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.1 Reporter: Andrew Davidson Priority: Minor I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I get and exception 'Randomness of hash of string should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not just set PYTHONHASHSEED ? Should I file a bug? Kind regards Andy details http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract Example from documentation does not work out of the box Subtract(other, numPartitions=None) Return each value in self that is not contained in other. >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)]) >>> y = sc.parallelize([("a", 3), ("c", None)]) >>> sorted(x.subtract(y).collect()) [('a', 1), ('b', 4), ('b', 5)] It raises if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ: raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED") The following script fixes the problem Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash of string should be disabled via PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf/spark-env.sh; done This is how I am starting spark export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN" $SPARK_ROOT/bin/pyspark \ --master $MASTER_URL \ --total-executor-cores $numCores \ --driver-memory 2G \ --executor-memory 2G \ $extraPkgs \ $* see email thread "possible bug spark/python/pyspark/rdd.py portable_hash()' on user@spark for more info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12040: Assignee: Apache Spark > Add toJson/fromJson to Vector/Vectors for PySpark > - > > Key: SPARK-12040 > URL: https://issues.apache.org/jira/browse/SPARK-12040 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Trivial > Labels: starter > > Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one > SPARK-11766. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036338#comment-15036338 ] Apache Spark commented on SPARK-12040: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/10106 > Add toJson/fromJson to Vector/Vectors for PySpark > - > > Key: SPARK-12040 > URL: https://issues.apache.org/jira/browse/SPARK-12040 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Trivial > Labels: starter > > Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one > SPARK-11766. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12040: Assignee: (was: Apache Spark) > Add toJson/fromJson to Vector/Vectors for PySpark > - > > Key: SPARK-12040 > URL: https://issues.apache.org/jira/browse/SPARK-12040 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Trivial > Labels: starter > > Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one > SPARK-11766. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036280#comment-15036280 ] Davies Liu commented on SPARK-12089: Could you turn on debug log, and paste the java source code of the SpecificUnsafeProjection (generated unsafe projection) here? > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12098) Cross validator with multi-arm bandit search
[ https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12098: Assignee: (was: Apache Spark) > Cross validator with multi-arm bandit search > > > Key: SPARK-12098 > URL: https://issues.apache.org/jira/browse/SPARK-12098 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xusen Yin > > The classic cross-validation requires all inner classifiers iterate to a > fixed number of iterations, or until convergence states. It is costly > especially in the massive data scenario. According to the paper > Non-stochastic Best Arm Identification and Hyperparameter Optimization > (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce > the amount of total iterations of cross-validation with multi-armed bandit > search. > The multi-armed bandit search for cross-validation (bandit search for short) > requires warm-start of ml algorithms, and fine-grained control of the inner > behavior of the corss validator. > Since there are bunch of algorithms of bandit search to find the best > parameter set, we intent to provide only a few of them in the beginning to > reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12098) Cross validator with multi-arm bandit search
[ https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036231#comment-15036231 ] Apache Spark commented on SPARK-12098: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/10105 > Cross validator with multi-arm bandit search > > > Key: SPARK-12098 > URL: https://issues.apache.org/jira/browse/SPARK-12098 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xusen Yin > > The classic cross-validation requires all inner classifiers iterate to a > fixed number of iterations, or until convergence states. It is costly > especially in the massive data scenario. According to the paper > Non-stochastic Best Arm Identification and Hyperparameter Optimization > (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce > the amount of total iterations of cross-validation with multi-armed bandit > search. > The multi-armed bandit search for cross-validation (bandit search for short) > requires warm-start of ml algorithms, and fine-grained control of the inner > behavior of the corss validator. > Since there are bunch of algorithms of bandit search to find the best > parameter set, we intent to provide only a few of them in the beginning to > reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12098) Cross validator with multi-arm bandit search
[ https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12098: Assignee: Apache Spark > Cross validator with multi-arm bandit search > > > Key: SPARK-12098 > URL: https://issues.apache.org/jira/browse/SPARK-12098 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xusen Yin >Assignee: Apache Spark > > The classic cross-validation requires all inner classifiers iterate to a > fixed number of iterations, or until convergence states. It is costly > especially in the massive data scenario. According to the paper > Non-stochastic Best Arm Identification and Hyperparameter Optimization > (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce > the amount of total iterations of cross-validation with multi-armed bandit > search. > The multi-armed bandit search for cross-validation (bandit search for short) > requires warm-start of ml algorithms, and fine-grained control of the inner > behavior of the corss validator. > Since there are bunch of algorithms of bandit search to find the best > parameter set, we intent to provide only a few of them in the beginning to > reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11801) Notify driver when OOM is thrown before executor JVM is killed
[ https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036302#comment-15036302 ] Imran Rashid commented on SPARK-11801: -- to summarize, it seems we agree that: 1) we want to keep {{OnOutOfMemoryError}} in yarn mode. 2) we can't guarantee anything making it back to the driver at all on OOM I *think* there is agreement that: 3) the current situation can lead to misleading msgs, which can be improved and we are not totally agreed on: 4) should other cluster managers user {{OnOutOfMemoryError}}? But this is independent, I've opened SPARK-12099 to deal with that, and I think we can ignore that here 5) *how* exactly should we improve the msgs? So the main thing to discuss is #5. I think there are a few options: (a) given that we can't guarantee anything useful gets back to the driver, lets try to keep things consistent and always have the driver just receive a generic "executor lost" type of message. We could still try to improve the logs on the executor somewhat with the shtudown reprioritization (b) make a best effort at getting a better error msg to the driver, and improve the logs on the executor. I think this would include all of (i) handle OOM specially in {{TaskRunner}}, sending a msg back to the driver immediately (ii) kill the running tasks, in a handler that is higher priority than the disk cleanup (iii) have {{YarnAllocator}} deal with {{SparkExitCode.OOM}} There are probably more details to discuss on (b), I am not sure that proposal is 100% correct, but maybe to start, can we discuss (a) vs (b)? [~mrid...@yahoo-inc.com] you voiced a preference for (a) -- but do you feel very strongly about that? I have a strong preference (b), since I think we can do decently in most cases, and it would really help a lot of users out. > Notify driver when OOM is thrown before executor JVM is killed > --- > > Key: SPARK-11801 > URL: https://issues.apache.org/jira/browse/SPARK-11801 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Priority: Minor > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. It is unclear in driver logs/Spark UI why the task is lost or > executor is lost. Customer has to look into the executor logs to see OOM is > the cause for the task/executor lost. > It would be helpful if driver logs/spark UI shows the reason for task > failures by making sure that task updates the driver with OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036327#comment-15036327 ] Erik Selin commented on SPARK-12089: It's a bunch of table joins followed by a group by on multiple fields. One of the fields being a month window. Example: {code} select x.a, x.b, x.c, x.month from ( select a, b, c, concat(year(from_unixtime(t)), "-", month(from_unixtime(t))) AS month from foo left join bar on foo.bar_id = bar.id left join biz on foo.biz_id = biz.id ) as x group by x.a, x.b, x.c, x.month {code} My hypothesis, but I need your expertise to confirm/deny since I'm really not familiar with this area of sparks codebase. Is that the monthly group by is indeed creating something huge, perhaps by bucketing a lot of data together into the same buffer? I have similar jobs running very similar queries but grouped by day and they are not running into this issue. I'll do a debug log run once I have some spare cycles on my end! :) > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036341#comment-15036341 ] Thomas Graves commented on SPARK-1239: -- I have another user hitting this also. The above mentions other issues that need to be addressed in MapOutputStatusTracker do you have links to those other issues? > Don't fetch all map output statuses at each reducer during shuffles > --- > > Key: SPARK-1239 > URL: https://issues.apache.org/jira/browse/SPARK-1239 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Patrick Wendell > > Instead we should modify the way we fetch map output statuses to take both a > mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036418#comment-15036418 ] Christoph Pirkl commented on SPARK-10969: - This issue is about adding additional (optional) credentials for dynamodb and cloudwatch. In my opinion it would be better to introduce a parameter object containing all credentials than to have four additional String parameters. I don't think it is possible to use the same mechanism as in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L98 (i.e. reading the credentials from System.getenv()) as this would mean defining environment variables on each node. A generic configuration using a map would mean a potential error source when you mistype a parameter name. > Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis > and DynamoDB > --- > > Key: SPARK-10969 > URL: https://issues.apache.org/jira/browse/SPARK-10969 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Christoph Pirkl >Priority: Critical > > {{KinesisUtils.createStream()}} allows specifying only one set of AWS > credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB > and CloudWatch. > h5. Motivation > In a scenario where one needs to read from a Kinesis Stream owned by a > different AWS account the user usually has minimal rights (i.e. only read > from the stream). In this case creating the DynamoDB table in KCL will fail. > h5. Proposal > My proposed solution would be to allow specifying multiple credentials in > {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The > additional credentials could then be passed to the constructor of > {{KinesisClientLibConfiguration}} or method > {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib
[ https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-11219: - Description: There are several different formats for describing params in PySpark.MLlib, making it unclear what the preferred way to document is, i.e. vertical alignment vs single line. This is to agree on a format and make it consistent across PySpark.MLlib. Following the discussion in SPARK-10560, using 2 lines with an indentation is both readable and doesn't lead to changing many lines when adding/removing parameters. If the parameter uses a default value, put this in parenthesis in a new line under the description. Example: {noformat} :param stepSize: Step size for each iteration of gradient descent. (default: 0.1) :param numIterations: Number of iterations run for each batch of data. (default: 50) {noformat} h2. Current State of Parameter Description Formating h4. Classification * LogisticRegressionModel - single line descriptions, fix indentations * LogisticRegressionWithSGD - vertical alignment, sporatic default values * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values * SVMModel - single line * SVMWithSGD - vertical alignment, sporatic default values * NaiveBayesModel - single line * NaiveBayes - single line h4. Clustering * KMeansModel - missing param description * KMeans - missing param description and defaults * GaussianMixture - vertical align, incorrect default formatting * PowerIterationClustering - single line with wrapped indentation, missing defaults * StreamingKMeansModel - single line wrapped * StreamingKMeans - single line wrapped, missing defaults * LDAModel - single line * LDA - vertical align, mising some defaults h4. FPM * FPGrowth - single line * PrefixSpan - single line, defaults values in backticks h4. Recommendation * ALS - does not have param descriptions h4. Regression * LabeledPoint - single line * LinearModel - single line * LinearRegressionWithSGD - vertical alignment * RidgeRegressionWithSGD - vertical align * IsotonicRegressionModel - single line * IsotonicRegression - single line, missing default h4. Tree * DecisionTree - single line with vertical indentation, missing defaults * RandomForest - single line with wrapped indent, missing some defaults * GradientBoostedTrees - single line with wrapped indent NOTE This issue will just focus on model/algorithm descriptions, which are the largest source of inconsistent formatting evaluation.py, feature.py, random.py, utils.py - these supporting classes have param descriptions as single line, but are consistent so don't need to be changed was: There are several different formats for describing params in PySpark.MLlib, making it unclear what the preferred way to document is, i.e. vertical alignment vs single line. This is to agree on a format and make it consistent across PySpark.MLlib. Following the discussion in SPARK-10560, using 2 lines with an indentation is both readable and doesn't lead to changing many lines when adding/removing parameters. If the parameter uses a default value, put this in parenthesis in a new line under the description. Example: {noformat} :param stepSize: Step size for each iteration of gradient descent. (default: 0.1) :param numIterations: Number of iterations run for each batch of data. (default: 50) {noformat} > Make Parameter Description Format Consistent in PySpark.MLlib > - > > Key: SPARK-11219 > URL: https://issues.apache.org/jira/browse/SPARK-11219 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > There are several different formats for describing params in PySpark.MLlib, > making it unclear what the preferred way to document is, i.e. vertical > alignment vs single line. > This is to agree on a format and make it consistent across PySpark.MLlib. > Following the discussion in SPARK-10560, using 2 lines with an indentation is > both readable and doesn't lead to changing many lines when adding/removing > parameters. If the parameter uses a default value, put this in parenthesis > in a new line under the description. > Example: > {noformat} > :param stepSize: > Step size for each iteration of gradient descent. > (default: 0.1) > :param numIterations: > Number of iterations run for each batch of data. > (default: 50) > {noformat} > h2. Current State of Parameter Description Formating > h4. Classification > * LogisticRegressionModel - single line descriptions, fix indentations > * LogisticRegressionWithSGD - vertical alignment, sporatic default values > * LogisticRegressionWithLBFGS - vertical alignment, sporatic default
[jira] [Commented] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036487#comment-15036487 ] Sean Owen commented on SPARK-12101: --- To cut down the noise, if this is logically the same issue as one you recently fixed, they should be attached to one JIRA. That is, someone who gets your first fix wants the second one; they aren't really separable. > Fix thread pools that cannot cache tasks in Worker and AppClient > > > Key: SPARK-12101 > URL: https://issues.apache.org/jira/browse/SPARK-12101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. > It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-10873: -- Assignee: Zhuo Liu > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Thomas Graves >Assignee: Zhuo Liu > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration
[ https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036375#comment-15036375 ] Apache Spark commented on SPARK-11155: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/10107 > Stage summary json should include stage duration > - > > Key: SPARK-11155 > URL: https://issues.apache.org/jira/browse/SPARK-11155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Imran Rashid >Assignee: Xin Ren >Priority: Minor > Labels: Starter > > The json endpoint for stages doesn't include information on the stage > duration that is present in the UI. This looks like a simple oversight, they > should be included. eg., the metrics should be included at > {{api/v1/applications//stages}}. The missing metrics are > {{submissionTime}} and {{completionTime}} (and whatever other metrics come > out of the discussion on SPARK-10930) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Description: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER(( val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE was: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER(( val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Dan Dutrow > Fix For: 1.0.1 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER(( > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong
[ https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036395#comment-15036395 ] Thomas Graves commented on SPARK-11701: --- Also seems related to https://github.com/apache/spark/pull/9288 > YARN - dynamic allocation and speculation active task accounting wrong > -- > > Key: SPARK-11701 > URL: https://issues.apache.org/jira/browse/SPARK-11701 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > I am using dynamic container allocation and speculation and am seeing issues > with the active task accounting. The Executor UI still shows active tasks on > the an executor but the job/stage is all completed. I think its also > affecting the dynamic allocation being able to release containers because it > thinks there are still tasks. > Its easily reproduce by using spark-shell, turn on dynamic allocation, then > run just a wordcount on decent sized file and set the speculation parameters > low: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.maxExecutors 10 > spark.dynamicAllocation.minExecutors 2 > spark.dynamicAllocation.initialExecutors 10 > spark.dynamicAllocation.executorIdleTimeout 40s > $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf > spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 > --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11155) Stage summary json should include stage duration
[ https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-11155: Attachment: Screen Shot 2015-12-02.png sorry for being so slow... my first code change and it took me quite a while getting familiar with Spark terminology and structure took me quite a while. I've submitted PR and please let me know if any change is needed. Thank you very much [~imranr] for your detailed guide for code change, it really helps me a lot! :) > Stage summary json should include stage duration > - > > Key: SPARK-11155 > URL: https://issues.apache.org/jira/browse/SPARK-11155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Imran Rashid >Assignee: Xin Ren >Priority: Minor > Labels: Starter > Attachments: Screen Shot 2015-12-02.png > > > The json endpoint for stages doesn't include information on the stage > duration that is present in the UI. This looks like a simple oversight, they > should be included. eg., the metrics should be included at > {{api/v1/applications//stages}}. The missing metrics are > {{submissionTime}} and {{completionTime}} (and whatever other metrics come > out of the discussion on SPARK-10930) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12101: Assignee: Apache Spark (was: Shixiong Zhu) > Fix thread pools that cannot cache tasks in Worker and AppClient > > > Key: SPARK-12101 > URL: https://issues.apache.org/jira/browse/SPARK-12101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. > It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Fix Version/s: (was: 1.0.1) 1.4.2 > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1 >Reporter: Dan Dutrow > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Affects Version/s: 1.1.0 1.2.0 1.3.0 1.4.0 1.4.1 > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1 >Reporter: Dan Dutrow > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
Shixiong Zhu created SPARK-12101: Summary: Fix thread pools that cannot cache tasks in Worker and AppClient Key: SPARK-12101 URL: https://issues.apache.org/jira/browse/SPARK-12101 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2, 1.4.1, 1.6.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12102) In check analysis, data type check always
Yin Huai created SPARK-12102: Summary: In check analysis, data type check always Key: SPARK-12102 URL: https://issues.apache.org/jira/browse/SPARK-12102 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 0) THEN struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type; line 1 pos 85}}. The problem is the nullability difference between {{4}} (non-nullable) and {{hash(4)}} (nullable). Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12102: - Summary: Cast a non-nullable struct field to a nullable field during analysis (was: In check analysis, data type check always ) > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
[ https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12000: - Target Version/s: 1.7.0 (was: 1.6.0) > `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation > - > > Key: SPARK-12000 > URL: https://issues.apache.org/jira/browse/SPARK-12000 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, MLlib >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Josh Rosen >Priority: Blocker > > Reported by [~josephkb]. Not sure what is the root cause, but this is the > error message when I ran "sbt publishLocal": > {code} > [error] (launcher/compile:doc) javadoc returned nonzero exit code > [error] (mllib/compile:doc) scala.reflect.internal.FatalError: > [error] while compiling: > /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala > [error] during phase: global=terminal, atPhase=parser > [error] library version: version 2.10.5 > [error] compiler version: version 2.10.5 > [error] reconstructed args: -Yno-self-type-checks -groups -classpath >
[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong
[ https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036366#comment-15036366 ] Thomas Graves commented on SPARK-11701: --- this looks like a dup of SPARK-9038 > YARN - dynamic allocation and speculation active task accounting wrong > -- > > Key: SPARK-11701 > URL: https://issues.apache.org/jira/browse/SPARK-11701 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > I am using dynamic container allocation and speculation and am seeing issues > with the active task accounting. The Executor UI still shows active tasks on > the an executor but the job/stage is all completed. I think its also > affecting the dynamic allocation being able to release containers because it > thinks there are still tasks. > Its easily reproduce by using spark-shell, turn on dynamic allocation, then > run just a wordcount on decent sized file and set the speculation parameters > low: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.maxExecutors 10 > spark.dynamicAllocation.minExecutors 2 > spark.dynamicAllocation.initialExecutors 10 > spark.dynamicAllocation.executorIdleTimeout 40s > $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf > spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 > --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12101: -- Priority: Minor (was: Major) > Fix thread pools that cannot cache tasks in Worker and AppClient > > > Key: SPARK-12101 > URL: https://issues.apache.org/jira/browse/SPARK-12101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. > It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12101: Assignee: Shixiong Zhu (was: Apache Spark) > Fix thread pools that cannot cache tasks in Worker and AppClient > > > Key: SPARK-12101 > URL: https://issues.apache.org/jira/browse/SPARK-12101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. > It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12101) Fix thread pools that cannot cache tasks in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-12101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036516#comment-15036516 ] Apache Spark commented on SPARK-12101: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10108 > Fix thread pools that cannot cache tasks in Worker and AppClient > > > Key: SPARK-12101 > URL: https://issues.apache.org/jira/browse/SPARK-12101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > SynchronousQueue cannot cache any task. This issue is similar to SPARK-11999. > It's an easy fix. Just use the fixed ThreadUtils.newDaemonCachedThreadPool. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Description: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER)) val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE was: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER(( val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Dan Dutrow > Fix For: 1.0.1 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib
[ https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036470#comment-15036470 ] Bryan Cutler commented on SPARK-11219: -- I added an assessment of the current state of param descriptions for algorithms/models in pyspark.mllib. To keep changes well separated, I will make sub-tasks for each Python file, except for FPM and Recommendation which are small and can probably be combined. > Make Parameter Description Format Consistent in PySpark.MLlib > - > > Key: SPARK-11219 > URL: https://issues.apache.org/jira/browse/SPARK-11219 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > There are several different formats for describing params in PySpark.MLlib, > making it unclear what the preferred way to document is, i.e. vertical > alignment vs single line. > This is to agree on a format and make it consistent across PySpark.MLlib. > Following the discussion in SPARK-10560, using 2 lines with an indentation is > both readable and doesn't lead to changing many lines when adding/removing > parameters. If the parameter uses a default value, put this in parenthesis > in a new line under the description. > Example: > {noformat} > :param stepSize: > Step size for each iteration of gradient descent. > (default: 0.1) > :param numIterations: > Number of iterations run for each batch of data. > (default: 50) > {noformat} > h2. Current State of Parameter Description Formating > h4. Classification > * LogisticRegressionModel - single line descriptions, fix indentations > * LogisticRegressionWithSGD - vertical alignment, sporatic default values > * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values > * SVMModel - single line > * SVMWithSGD - vertical alignment, sporatic default values > * NaiveBayesModel - single line > * NaiveBayes - single line > h4. Clustering > * KMeansModel - missing param description > * KMeans - missing param description and defaults > * GaussianMixture - vertical align, incorrect default formatting > * PowerIterationClustering - single line with wrapped indentation, missing > defaults > * StreamingKMeansModel - single line wrapped > * StreamingKMeans - single line wrapped, missing defaults > * LDAModel - single line > * LDA - vertical align, mising some defaults > h4. FPM > * FPGrowth - single line > * PrefixSpan - single line, defaults values in backticks > h4. Recommendation > * ALS - does not have param descriptions > h4. Regression > * LabeledPoint - single line > * LinearModel - single line > * LinearRegressionWithSGD - vertical alignment > * RidgeRegressionWithSGD - vertical align > * IsotonicRegressionModel - single line > * IsotonicRegression - single line, missing default > h4. Tree > * DecisionTree - single line with vertical indentation, missing defaults > * RandomForest - single line with wrapped indent, missing some defaults > * GradientBoostedTrees - single line with wrapped indent > NOTE > This issue will just focus on model/algorithm descriptions, which are the > largest source of inconsistent formatting > evaluation.py, feature.py, random.py, utils.py - these supporting classes > have param descriptions as single line, but are consistent so don't need to > be changed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
Dan Dutrow created SPARK-12103: -- Summary: KafkaUtils createStream with multiple topics -- does not work as expected Key: SPARK-12103 URL: https://issues.apache.org/jira/browse/SPARK-12103 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Dan Dutrow Fix For: 1.0.1 Default way of creating stream out of Kafka source would be as val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", Map("retarget" -> 2,"datapair" -> 2)) However, if two topics - in this case "retarget" and "datapair" - are very different, there is no way to set up different filter, mapping functions, etc), as they are effectively merged. However, instance of KafkaInputDStream, created with this call internally calls ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, keyed by topic. It would be great if this map would be exposed somehow, so aforementioned call val streamS = KafkaUtils.createStreamS(...) returned map of streams. Regards, Sergey Malov Collective Media -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Description: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER(( val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE was: Default way of creating stream out of Kafka source would be as val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", Map("retarget" -> 2,"datapair" -> 2)) However, if two topics - in this case "retarget" and "datapair" - are very different, there is no way to set up different filter, mapping functions, etc), as they are effectively merged. However, instance of KafkaInputDStream, created with this call internally calls ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, keyed by topic. It would be great if this map would be exposed somehow, so aforementioned call val streamS = KafkaUtils.createStreamS(...) returned map of streams. Regards, Sergey Malov Collective Media > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Dan Dutrow > Fix For: 1.0.1 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[STring, STring, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER(( > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Target Version/s: 1.4.2, 1.6.1 (was: 1.0.1) > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1 >Reporter: Dan Dutrow > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to pool resources. I have 10+ topics and don't want to > dedicate 10 cores to processing all of them. However, when reading the data > procuced by KafkaUtils.createStream, the DStream[(String,String)] does not > properly insert the topic name into the tuple. The left-key always null, > making it impossible to know what topic that data came from other than > stashing your key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close
[ https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035455#comment-15035455 ] Apache Spark commented on SPARK-12088: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/10095 > check connection.isClose before connection.getAutoCommit in JDBCRDD.close > - > > Key: SPARK-12088 > URL: https://issues.apache.org/jira/browse/SPARK-12088 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Huaxin Gao >Priority: Minor > > in JDBCRDD, it has > if (!conn.getAutoCommit && !conn.isClosed) { > try { > conn.commit() > } > . . . . . . > In my test, the connection is already closed so conn.getAutoCommit throw > Exception. We should check !conn.isClosed before checking !conn.getAutoCommit > to avoid the Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12093) Fix the error of comment in DDLParser
[ https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12093: Assignee: Apache Spark > Fix the error of comment in DDLParser > - > > Key: SPARK-12093 > URL: https://issues.apache.org/jira/browse/SPARK-12093 > Project: Spark > Issue Type: Documentation >Reporter: Yadong Qi >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12093) Fix the error of comment in DDLParser
[ https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035462#comment-15035462 ] Apache Spark commented on SPARK-12093: -- User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/10096 > Fix the error of comment in DDLParser > - > > Key: SPARK-12093 > URL: https://issues.apache.org/jira/browse/SPARK-12093 > Project: Spark > Issue Type: Documentation >Reporter: Yadong Qi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12093) Fix the error of comment in DDLParser
[ https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12093: Assignee: (was: Apache Spark) > Fix the error of comment in DDLParser > - > > Key: SPARK-12093 > URL: https://issues.apache.org/jira/browse/SPARK-12093 > Project: Spark > Issue Type: Documentation >Reporter: Yadong Qi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480 ] Bill Chambers commented on SPARK-11964: --- quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new release [and the user guide]? > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12048) JDBCRDD calls close() twice - SQLite then throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035486#comment-15035486 ] R. H. commented on SPARK-12048: --- I have added the missing line, built a distribution, and tested it successfully: SQLite no longer throws the exception (and a test against PostgreSQL was OK as well). The PR is in https://github.com/rh99/spark - What else can I do with the PR? > JDBCRDD calls close() twice - SQLite then throws an exception > - > > Key: SPARK-12048 > URL: https://issues.apache.org/jira/browse/SPARK-12048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: R. H. >Priority: Minor > Labels: starter > > The following code works: > {quote} > val tableData = sqlContext.read.format("jdbc") > .options( > Map( > "url" -> "jdbc:sqlite:/tmp/test.db", > "dbtable" -> "testtable")).load() > {quote} > but an exception gets reported. From the log: > {quote} > 15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection > 15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database > (Connection is closed) at org.sqlite.core.DB.newSQLException(DB.java:890) at > org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109) at > org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35) at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454) > {quote} > So Spark succeeded to close the JDBC connection, and then it fails to close > the JDBC statement. > Looking at the source, close() seems to be called twice: > {quote} > context.addTaskCompletionListener\{ context => close() \} > {quote} > and in > {quote} > def hasNext > {quote} > If you look at the close() method (around line 443) > {quote} > def close() { > if (closed) return > {quote} > you can see that it checks the variable closed, but that value is never set > to true. > So a trivial fix should be to set "closed = true" at the end of close(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035496#comment-15035496 ] Apache Spark commented on SPARK-12067: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10097 > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull. > * Add Column.notnull as alias of Column.isnotnull following the pandas naming > convention. > * Add DataFrame.isnotnull and DataFrame.notnull. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12067: Summary: Fix usage of isnan, isnull, isnotnull of Column (was: Fix usage of isnan, isnull, isnotnull of Column and DataFrame) > Fix usage of isnan, isnull, isnotnull of Column > --- > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull.- > * -Add Column.notnull as alias of Column.isnotnull following the pandas > naming convention.- > * -Add DataFrame.isnotnull and DataFrame.notnull.- > * Add Column.isNaN to PySpark. > * Add isnull, notnull, isnan as alias at Python side in order to be > compatible with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12067: Description: * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull.- * -Add Column.notnull as alias of Column.isnotnull following the pandas naming convention.- * -Add DataFrame.isnotnull and DataFrame.notnull.- * Add Column.isNaN to PySpark. * Add isnull, notnull, isnan as alias at Python side in order to be compatible with pandas. was: * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull.- * -Add Column.notnull as alias of Column.isnotnull following the pandas naming convention.- * -Add DataFrame.isnotnull and DataFrame.notnull.- Add Column.isNaN to PySpark. Add isnull, notnull, isnan as alias at Python side in order to be compatible with pandas. > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull.- > * -Add Column.notnull as alias of Column.isnotnull following the pandas > naming convention.- > * -Add DataFrame.isnotnull and DataFrame.notnull.- > * Add Column.isNaN to PySpark. > * Add isnull, notnull, isnan as alias at Python side in order to be > compatible with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close
[ https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12088: Assignee: Apache Spark > check connection.isClose before connection.getAutoCommit in JDBCRDD.close > - > > Key: SPARK-12088 > URL: https://issues.apache.org/jira/browse/SPARK-12088 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > in JDBCRDD, it has > if (!conn.getAutoCommit && !conn.isClosed) { > try { > conn.commit() > } > . . . . . . > In my test, the connection is already closed so conn.getAutoCommit throw > Exception. We should check !conn.isClosed before checking !conn.getAutoCommit > to avoid the Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close
[ https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12088: Assignee: (was: Apache Spark) > check connection.isClose before connection.getAutoCommit in JDBCRDD.close > - > > Key: SPARK-12088 > URL: https://issues.apache.org/jira/browse/SPARK-12088 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Huaxin Gao >Priority: Minor > > in JDBCRDD, it has > if (!conn.getAutoCommit && !conn.isClosed) { > try { > conn.commit() > } > . . . . . . > In my test, the connection is already closed so conn.getAutoCommit throw > Exception. We should check !conn.isClosed before checking !conn.getAutoCommit > to avoid the Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12093) Fix the error of comment in DDLParser
Yadong Qi created SPARK-12093: - Summary: Fix the error of comment in DDLParser Key: SPARK-12093 URL: https://issues.apache.org/jira/browse/SPARK-12093 Project: Spark Issue Type: Documentation Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035480#comment-15035480 ] Bill Chambers edited comment on SPARK-11964 at 12/2/15 8:52 AM: quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included, even those that are unresolved, in the new release [and the user guide]? was (Author: bill_chambers): quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new release [and the user guide]? > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12094) Better format for query plan tree string
Cheng Lian created SPARK-12094: -- Summary: Better format for query plan tree string Key: SPARK-12094 URL: https://issues.apache.org/jira/browse/SPARK-12094 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.7.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor When examine query plans of complex query plans with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. For example: {noformat} TakeOrderedAndProject ... ConvertToSafe Project ... TungstenAggregate ... TungstenExchange ... TungstenAggregate ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Scan ... Filter ... Scan ... Scan ... Project ... Filter ... Scan ... {noformat} Would be better to have tree lines to indicate relationships between plan nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12048) JDBCRDD calls close() twice - SQLite then throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035521#comment-15035521 ] Sean Owen commented on SPARK-12048: --- You need to read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark [~rh99] > JDBCRDD calls close() twice - SQLite then throws an exception > - > > Key: SPARK-12048 > URL: https://issues.apache.org/jira/browse/SPARK-12048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: R. H. >Priority: Minor > Labels: starter > > The following code works: > {quote} > val tableData = sqlContext.read.format("jdbc") > .options( > Map( > "url" -> "jdbc:sqlite:/tmp/test.db", > "dbtable" -> "testtable")).load() > {quote} > but an exception gets reported. From the log: > {quote} > 15/11/30 12:13:02 INFO jdbc.JDBCRDD: closed connection > 15/11/30 12:13:02 WARN jdbc.JDBCRDD: Exception closing statement > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database > (Connection is closed) at org.sqlite.core.DB.newSQLException(DB.java:890) at > org.sqlite.core.CoreStatement.internalClose(CoreStatement.java:109) at > org.sqlite.jdbc3.JDBC3Statement.close(JDBC3Statement.java:35) at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$anon$$close(JDBCRDD.scala:454) > {quote} > So Spark succeeded to close the JDBC connection, and then it fails to close > the JDBC statement. > Looking at the source, close() seems to be called twice: > {quote} > context.addTaskCompletionListener\{ context => close() \} > {quote} > and in > {quote} > def hasNext > {quote} > If you look at the close() method (around line 443) > {quote} > def close() { > if (closed) return > {quote} > you can see that it checks the variable closed, but that value is never set > to true. > So a trivial fix should be to set "closed = true" at the end of close(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12067: Description: -* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull. * Add Column.notnull as alias of Column.isnotnull following the pandas naming convention. * Add DataFrame.isnotnull and DataFrame.notnull.- Add Column.isNaN to PySpark. Add isnull, notnull, isnan as alias at Python side in order to be compatible with pandas. was: * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull. * Add Column.notnull as alias of Column.isnotnull following the pandas naming convention. * Add DataFrame.isnotnull and DataFrame.notnull. > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > -* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull. > * Add Column.notnull as alias of Column.isnotnull following the pandas naming > convention. > * Add DataFrame.isnotnull and DataFrame.notnull.- > Add Column.isNaN to PySpark. > Add isnull, notnull, isnan as alias at Python side in order to be compatible > with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12067: Description: * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull.- * -Add Column.notnull as alias of Column.isnotnull following the pandas naming convention.- * -Add DataFrame.isnotnull and DataFrame.notnull.- Add Column.isNaN to PySpark. Add isnull, notnull, isnan as alias at Python side in order to be compatible with pandas. was: -* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull. * Add Column.notnull as alias of Column.isnotnull following the pandas naming convention. * Add DataFrame.isnotnull and DataFrame.notnull.- Add Column.isNaN to PySpark. Add isnull, notnull, isnan as alias at Python side in order to be compatible with pandas. > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > * -SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull.- > * -Add Column.notnull as alias of Column.isnotnull following the pandas > naming convention.- > * -Add DataFrame.isnotnull and DataFrame.notnull.- > Add Column.isNaN to PySpark. > Add isnull, notnull, isnan as alias at Python side in order to be compatible > with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12093) Fix the error of comment in DDLParser
[ https://issues.apache.org/jira/browse/SPARK-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12093: -- Priority: Trivial (was: Major) Component/s: Documentation [~waterman] This probably isn't worth a JIRA. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and fill our JIRAs more carefully in the future. This has no component, is not "Major". Most importantly, this JIRA provides no information about the problem. > Fix the error of comment in DDLParser > - > > Key: SPARK-12093 > URL: https://issues.apache.org/jira/browse/SPARK-12093 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Yadong Qi >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12094) Better format for query plan tree string
[ https://issues.apache.org/jira/browse/SPARK-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12094: --- Description: When examine plans of complex queries with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. For example: {noformat} TakeOrderedAndProject ... ConvertToSafe Project ... TungstenAggregate ... TungstenExchange ... TungstenAggregate ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Scan ... Filter ... Scan ... Scan ... Project ... Filter ... Scan ... {noformat} Would be better to have tree lines to indicate relationships between plan nodes. was: When examine query plans of complex query plans with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. For example: {noformat} TakeOrderedAndProject ... ConvertToSafe Project ... TungstenAggregate ... TungstenExchange ... TungstenAggregate ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Project ... BroadcastHashJoin ... Scan ... Filter ... Scan ... Scan ... Project ... Filter ... Scan ... {noformat} Would be better to have tree lines to indicate relationships between plan nodes. > Better format for query plan tree string > > > Key: SPARK-12094 > URL: https://issues.apache.org/jira/browse/SPARK-12094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.7.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > When examine plans of complex queries with multiple joins, a pain point of > mine is that, it's hard to immediately see the sibling node of a specific > query plan node. For example: > {noformat} > TakeOrderedAndProject ... > ConvertToSafe > Project ... >TungstenAggregate ... > TungstenExchange ... > TungstenAggregate ... > Project ... >BroadcastHashJoin ... > Project ... > BroadcastHashJoin ... > Project ... >BroadcastHashJoin ... > Scan ... > Filter ... > Scan ... > Scan ... > Project ... > Filter ... > Scan ... > {noformat} > Would be better to have tree lines to indicate relationships between plan > nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12080) Kryo - Support multiple user registrators
[ https://issues.apache.org/jira/browse/SPARK-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12080: -- Affects Version/s: (was: 1.6.1) 1.5.2 > Kryo - Support multiple user registrators > - > > Key: SPARK-12080 > URL: https://issues.apache.org/jira/browse/SPARK-12080 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Rotem >Priority: Minor > Labels: kryo, registrator, serializers > Original Estimate: 72h > Remaining Estimate: 72h > > Background: Currently when users need to have a custom serializer for their > registered classes, they use the user registrator of Kryo using the > spark.kryo.registrator configuration parameter. > Problem: If the Spark user is an infrastructure itself, it may receive > multiple such registrators but won't be able to register them. > Important note: Currently the single registrator supported can't reach any > state/configuration (it is instantiated by reflection with empty constructor) > Using SparkEnv from user code isn't acceptable. > Workaround: > Create a wrapper registrator as a user, and have its implementation scan the > class path for multiple classes. > Caveat: this is inefficient and too complicated. > Suggested solution - support multiple registrators + stay backward compatible > Option 1: > enhance the value of spark.kryo.registrator to support a comma separated > list for class names. This will be backward compatible and won't add new > parameters. > Option 2: > to be more logical, add spark.kryo.registrators new parameter, while keeping > the code handling the old one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036613#comment-15036613 ] Marcelo Vanzin commented on SPARK-10911: Ok, that makes sense. But since the YARN bug is the source of the problem here, the commit message (and maybe even the code) should mention it. > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12104) collect() does not handle multiple columns with same name
Hossein Falaki created SPARK-12104: -- Summary: collect() does not handle multiple columns with same name Key: SPARK-12104 URL: https://issues.apache.org/jira/browse/SPARK-12104 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.0 Reporter: Hossein Falaki Priority: Critical This is a regression from Spark 1.5 Spark can produce DataFrames with identical names (e.g., after left outer joins). In 1.5 when such a DataFrame was collected we ended up with an R data.frame with modified column names: {code} > names(mySparkDF) [1] "date" "name" "name" > names(collect(mySparkDF)) [1] "date" "name" "name.1" {code} But in 1.6 only the first column is included in the collected R data.frame. I think SparkR should continue the old behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry
[ https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036656#comment-15036656 ] Apache Spark commented on SPARK-12106: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/10110 > Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the > timestamp of last entry > > > Key: SPARK-12106 > URL: https://issues.apache.org/jira/browse/SPARK-12106 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Burak Yavuz > > This test is still transiently flaky, because async methods can finish out of > order, and mess with the naming in the queue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry
[ https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12106: Assignee: (was: Apache Spark) > Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the > timestamp of last entry > > > Key: SPARK-12106 > URL: https://issues.apache.org/jira/browse/SPARK-12106 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Burak Yavuz > > This test is still transiently flaky, because async methods can finish out of > order, and mess with the naming in the queue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12103: -- Affects Version/s: (was: 1.4.0) (was: 1.3.0) (was: 1.2.0) (was: 1.1.0) (was: 1.0.0) Target Version/s: (was: 1.4.2, 1.6.1) Priority: Minor (was: Major) Component/s: Documentation > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry
[ https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12106: Assignee: Apache Spark > Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the > timestamp of last entry > > > Key: SPARK-12106 > URL: https://issues.apache.org/jira/browse/SPARK-12106 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Burak Yavuz >Assignee: Apache Spark > > This test is still transiently flaky, because async methods can finish out of > order, and mess with the naming in the queue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12107) Update spark-ec2 versions
[ https://issues.apache.org/jira/browse/SPARK-12107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12107: - Target Version/s: 1.6.0 > Update spark-ec2 versions > - > > Key: SPARK-12107 > URL: https://issues.apache.org/jira/browse/SPARK-12107 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.0 >Reporter: Nicholas Chammas >Priority: Minor > > spark-ec2's version strings are out-of-date. The latest versions of Spark > need to be reflected in its internal version maps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator
[ https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036685#comment-15036685 ] Min Qiu commented on SPARK-12085: - looks like the BooleanSimplification rule in Spark 1.5 provides a general way to rewrite the predicate expression. It should covers my cases. Will test the query on Spark 1.5. > The join condition hidden in DNF can't be pushed down to join operator > --- > > Key: SPARK-12085 > URL: https://issues.apache.org/jira/browse/SPARK-12085 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Min Qiu > > TPC-H Q19: > {quote} > SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join > lineitem > WHERE ({color: red}p_partkey = l_partkey {color} >AND p_brand = 'Brand#12' >AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') >AND l_quantity >= 1 AND l_quantity <= 1 + 10 >AND p_size BETWEEN 1 AND 5 >AND l_shipmode IN ('AIR', 'AIR REG') >AND l_shipinstruct = 'DELIVER IN PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#23' >AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') >AND l_quantity >= 10 AND l_quantity <= 10 + 10 >AND p_size BETWEEN 1 AND 10 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#34' >AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') >AND l_quantity >= 20 AND l_quantity <= 20 + 10 >AND p_size BETWEEN 1 AND 15 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') > {quote} > The equality condition {color:red} p_partkey = l_partkey{color} matches the > join relations but it cannot be recogized by optimizer because it's hidden in > a disjunctive normal form. As a result the entire where clause will be in a > filter operator on top of the join operator where the join condition would be > "None" in the optimized plan. Finally the query planner will apply a > prohibitive expensive cartesian product on the physical plan which causes OOM > exception or very bad performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036710#comment-15036710 ] Xiangrui Meng commented on SPARK-8517: -- * I'm not sure whether the mathematical formulation is helpful or not. They might be useful to explain the parameters but it seems unnecessary for us to explain how the models work. I'm okay with copying the content. * We don't have linear SVM in spark.ml. * +1 on moving perceptron classifier to classification. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12082) NettyBlockTransferSecuritySuite "security mismatch auth off on client" test is flaky
[ https://issues.apache.org/jira/browse/SPARK-12082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036738#comment-15036738 ] Marcelo Vanzin commented on SPARK-12082: If the build machines run on VMs, run multiple jobs, or have anything else that may lead to sporadic stalls / slowness, I wouldn't be surprised when timeouts turn out to be unreliable. Our internal test machines are way more loaded than those used by Spark's jenkins, and we run into a bunch of different time outs that the Spark builds never see. > NettyBlockTransferSecuritySuite "security mismatch auth off on client" test > is flaky > > > Key: SPARK-12082 > URL: https://issues.apache.org/jira/browse/SPARK-12082 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen > Labels: flaky-test > > The NettyBlockTransferSecuritySuite "security mismatch auth off on client" > test is flaky in Jenkins. Here's a link to a report that lists the latest > failures over the past week+: > https://spark-tests.appspot.com/tests/org.apache.spark.network.netty.NettyBlockTransferSecuritySuite/security%20mismatch%20auth%20off%20on%20client#latest-failures > In all of these failures, the test failed with the following exception: > {code} > Futures timed out after [1000 milliseconds] > java.util.concurrent.TimeoutException: Futures timed out after [1000 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) > at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) > at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.ready(package.scala:86) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.fetchBlock(NettyBlockTransferSecuritySuite.scala:151) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.org$apache$spark$network$netty$NettyBlockTransferSecuritySuite$$testConnection(NettyBlockTransferSecuritySuite.scala:116) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply$mcV$sp(NettyBlockTransferSecuritySuite.scala:90) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at
[jira] [Commented] (SPARK-12082) NettyBlockTransferSecuritySuite "security mismatch auth off on client" test is flaky
[ https://issues.apache.org/jira/browse/SPARK-12082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036743#comment-15036743 ] Josh Rosen commented on SPARK-12082: For now, I'm going to try bumping the timeout to something higher, like 10 seconds, and keep an eye on the test dashboard to see if that's a sufficient fix. > NettyBlockTransferSecuritySuite "security mismatch auth off on client" test > is flaky > > > Key: SPARK-12082 > URL: https://issues.apache.org/jira/browse/SPARK-12082 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Josh Rosen > Labels: flaky-test > > The NettyBlockTransferSecuritySuite "security mismatch auth off on client" > test is flaky in Jenkins. Here's a link to a report that lists the latest > failures over the past week+: > https://spark-tests.appspot.com/tests/org.apache.spark.network.netty.NettyBlockTransferSecuritySuite/security%20mismatch%20auth%20off%20on%20client#latest-failures > In all of these failures, the test failed with the following exception: > {code} > Futures timed out after [1000 milliseconds] > java.util.concurrent.TimeoutException: Futures timed out after [1000 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) > at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) > at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.ready(package.scala:86) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.fetchBlock(NettyBlockTransferSecuritySuite.scala:151) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite.org$apache$spark$network$netty$NettyBlockTransferSecuritySuite$$testConnection(NettyBlockTransferSecuritySuite.scala:116) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply$mcV$sp(NettyBlockTransferSecuritySuite.scala:90) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84) > at > org.apache.spark.network.netty.NettyBlockTransferSecuritySuite$$anonfun$5.apply(NettyBlockTransferSecuritySuite.scala:84) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at org.scalatest.FunSuite.run(FunSuite.scala:1555) > at
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036807#comment-15036807 ] Davies Liu commented on SPARK-12089: This query will not generate huge record, each record should be less than 100B. > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Critical > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11992) Severl numbers in my spark shell (pyspark)
[ https://issues.apache.org/jira/browse/SPARK-11992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto Bonsanto closed SPARK-11992. > Severl numbers in my spark shell (pyspark) > -- > > Key: SPARK-11992 > URL: https://issues.apache.org/jira/browse/SPARK-11992 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.2 > Environment: Linux Ubuntu 14.04 LTS > Jupyter > Spark 1.5.2 >Reporter: Alberto Bonsanto >Priority: Critical > Labels: newbie > > The problem is very weird, I am currently trying to fit some classifiers from > mllib library (SVM, LogisticRegression, RandomForest, DecisionTree and > NaiveBayes), so they might classify the data properly, I am trying to compare > their performances evaluating their predictions using my current validation > data (the typical pipeline), and the problem is that when I try to fit any of > those, my spark-shell console prints millions and millions of entries, and > after that the fitting process gets stopped, you can see it > [here|http://i.imgur.com/mohLnwr.png] > Some details: > - My data has around 15M of entries. > - I use LabeledPoints to represent each entry, where the features are > SparseVectors and they have *104* features or dimensions. > - I don't show many things in the console, > [log4j.properties|https://gist.github.com/Bonsanto/c487624db805f56882b8] > - The program is running locally in a computer with 16GB of RAM. > I have already asked this, in StackOverflow, you can see it here [Crazy > print|http://stackoverflow.com/questions/33807347/pyspark-shell-outputs-several-numbers-instead-of-the-loading-arrow] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow updated SPARK-12103: --- Description: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to use one (or a few) Kafka Streaming Receiver to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER)) val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE was: (Note: yes, there is a Direct API that may be better, but it's not the easiest thing to get started with. The Kafka Receiver API still needs to work, especially for newcomers) When creating a receiver stream using KafkaUtils, there is a valid use case where you would want to pool resources. I have 10+ topics and don't want to dedicate 10 cores to processing all of them. However, when reading the data procuced by KafkaUtils.createStream, the DStream[(String,String)] does not properly insert the topic name into the tuple. The left-key always null, making it impossible to know what topic that data came from other than stashing your key into the value. Is there a way around that problem? CODE val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, "topicE" -> 1, "topicF" -> 1, ...) val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( i => KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, consumerProperties, topics, StorageLevel.MEMORY_ONLY_SER)) val unioned :DStream[(String,String)] = ssc.union(streams) unioned.flatMap(x => { val (key, value) = x // key is always null! // value has data from any one of my topics key match ... { .. } } END CODE > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1 >Reporter: Dan Dutrow > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036690#comment-15036690 ] Xiangrui Meng commented on SPARK-8517: -- * Agree that the focus of spark.ml should not only be pipeline but also DataFrames. * +1 on reorganizing the menu spark.ml menu based on the goal. * Users should be able to use individual algorithms under spark.ml. There are some missing features, but this is not part of this JIRA. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036741#comment-15036741 ] Dan Dutrow commented on SPARK-12103: One possible way around this problem would be to stick the topic name into the message key. Unless the message key is supposed to be unique, then this would allow you to retrieve the topic name from the data. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0, 1.4.1 >Reporter: Dan Dutrow > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12104) collect() does not handle multiple columns with same name
[ https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036791#comment-15036791 ] Shivaram Venkataraman commented on SPARK-12104: --- Any ideas what caused this ? > collect() does not handle multiple columns with same name > - > > Key: SPARK-12104 > URL: https://issues.apache.org/jira/browse/SPARK-12104 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Hossein Falaki >Priority: Critical > > This is a regression from Spark 1.5 > Spark can produce DataFrames with identical names (e.g., after left outer > joins). In 1.5 when such a DataFrame was collected we ended up with an R > data.frame with modified column names: > {code} > > names(mySparkDF) > [1] "date" "name" "name" > > names(collect(mySparkDF)) > [1] "date" "name" "name.1" > {code} > But in 1.6 only the first column is included in the collected R data.frame. I > think SparkR should continue the old behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
Andrew Davidson created SPARK-12110: --- Summary: spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive Key: SPARK-12110 URL: https://issues.apache.org/jira/browse/SPARK-12110 Project: Spark Issue Type: Bug Components: ML, PySpark, SQL Affects Versions: 1.5.1 Environment: cluster created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 Reporter: Andrew Davidson I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I can not run the tokenizer sample code. Is there a work around? Kind regards Andy /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 658 raise Exception("You must build Spark with Hive. " 659 "Export 'SPARK_HIVE=true' and run " --> 660 "build/sbt assembly", e) 661 662 def _get_hive_ctx(self): Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError('An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) http://spark.apache.org/docs/latest/ml-features.html#tokenizer from pyspark.ml.feature import Tokenizer, RegexTokenizer sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) for words_label in wordsDataFrame.select("words", "label").take(3): print(words_label) --- Py4JJavaError Traceback (most recent call last) /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 654 if not hasattr(self, '_scala_HiveContext'): --> 655 self._scala_HiveContext = self._get_hive_ctx() 656 return self._scala_HiveContext /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) 662 def _get_hive_ctx(self): --> 663 return self._jvm.HiveContext(self._jsc.sc()) 664 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 700 return_value = get_return_value(answer, self._gateway_client, None, --> 701 self._fqn) 702 /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 35 try: ---> 36 return f(*a, **kw) 37 except py4j.protocol.Py4JJavaError as e: /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.io.IOException: Filesystem closed at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554) at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599) at
[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036809#comment-15036809 ] shane knapp commented on SPARK-11255: - ok, [~shivaram] and i got together and whipped up a test build on our staging node w/R 3.1.1, and it looks like we're good! https://amplab.cs.berkeley.edu/jenkins/view/AMPLab%20Infra/job/SparkR-testing/2/ [jenkins@amp-jenkins-staging-01 ~]# R --version | head -1 R version 3.1.1 (2014-07-10) -- "Sock it to Me" i'll need to schedule some jenkins-wide downtime (before EOY) to remove all traces of the old R, and install 3.1.1 and recompile any additional packages. > R Test build should run on R 3.1.1 > -- > > Key: SPARK-11255 > URL: https://issues.apache.org/jira/browse/SPARK-11255 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Priority: Minor > > Test should run on R 3.1.1 which is the version listed as supported. > Apparently there are few R changes that can go undetected since Jenkins Test > build is running something newer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
Dean Wampler created SPARK-12105: Summary: Add a DataFrame.show() with argument for output PrintStream Key: SPARK-12105 URL: https://issues.apache.org/jira/browse/SPARK-12105 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.2 Reporter: Dean Wampler Priority: Minor It would be nice to send the output of DataFrame.show(...) to a different output stream than stdout, including just capturing the string itself. This is useful, e.g., for testing. Actually, it would be sufficient and perhaps better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join
[ https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036683#comment-15036683 ] Michael Armbrust commented on SPARK-12066: -- Can you reproduce this on 1.6-rc1? > spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* > with join > - > > Key: SPARK-12066 > URL: https://issues.apache.org/jira/browse/SPARK-12066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.5.2 > Environment: linux >Reporter: Ricky Yang >Priority: Blocker > > throw java.lang.ArrayIndexOutOfBoundsException when I use following spark > sql on spark standlone or yarn. >the sql: > select ta.* > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_from > and ta.sale_price < tb.pri_to limit 10 ; > But ,the result is correct when using no * as following: > select ta.sale_dt > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_from > and ta.sale_price < tb.pri_to limit 10 ; > standlone version is 1.4.0 and version spark on yarn is 1.5.2 > error log : > > 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_from > and ta.sale_price < tb.pri_to limit 10 ] > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) > > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587) > > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308) > > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) > at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) > at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166) > > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Updated] (SPARK-12106) Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry
[ https://issues.apache.org/jira/browse/SPARK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12106: --- Labels: flaky-test (was: ) > Flaky Test: BatchedWriteAheadLog - name log with aggregated entries with the > timestamp of last entry > > > Key: SPARK-12106 > URL: https://issues.apache.org/jira/browse/SPARK-12106 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Burak Yavuz > Labels: flaky-test > > This test is still transiently flaky, because async methods can finish out of > order, and mess with the naming in the queue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036718#comment-15036718 ] Xiangrui Meng commented on SPARK-8517: -- [~timhunter] I agree with most of your points. I'd recommend the following steps: 1. Reorganize the content of spark.ml guide based on the goals, but do not introduce new content if possible. 2. Create JIRAs for the rest tasks, which could be done in parallel, including: * spark.ml branding * move model selection / cross validation to tuning * enhance guide for individual algorithms, etc > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org