[jira] [Commented] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546846#comment-14546846 ] Sean Owen commented on SPARK-7670: -- I can't reproduce this on Ubuntu 14 at master either. I think it's a problem with your env? others are using Scala 2.11 and I presume without problems. Are there earlier errors? Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546793#comment-14546793 ] Fernando Ruben Otero edited comment on SPARK-7670 at 5/16/15 4:11 PM: -- This docker file reproduces the error I see on my machine I reproduced the same behavior on a OSX and Ubuntu too was (Author: zeos): This docker file reproduces the error on my machine Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546793#comment-14546793 ] Fernando Ruben Otero edited comment on SPARK-7670 at 5/16/15 4:13 PM: -- This docker file reproduces the error I see on my machine on a clean fedora Even though the SO should not be the issue, I reproduced the same behavior on a OSX and Ubuntu too since I usually work with those environments. was (Author: zeos): This docker file reproduces the error I see on my machine I reproduced the same behavior on a OSX and Ubuntu too Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546875#comment-14546875 ] Sean Owen commented on SPARK-7670: -- Yeah I see the same thing with your Dockerfile. The strange thing is, the problem shows in compiling the Java code. I'm still tempted to say this must be something funny about this environment since I have two other environments where it's fine, but I don't know whether it's down to Java version or Scala or what. I have Java 8 / Scala 2.11.6 in both cases and it works. the thing is that the Java code is valid and correct, so at best this is some compiler problem? so I'm not sure what to do if it's not a Spark issue and doesn't affect the build envs that developers will use to produce artifacts. Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7527) Wrong detection of REPL mode in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546892#comment-14546892 ] Oleksii Kostyliev commented on SPARK-7527: -- In the end, due to a bigger complexity decided to be fixed separately from SPARK-7233 Wrong detection of REPL mode in ClosureCleaner -- Key: SPARK-7527 URL: https://issues.apache.org/jira/browse/SPARK-7527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Oleksii Kostyliev Priority: Minor If REPL class is not present on the classpath, the {{inIntetpreter}} boolean switch shall be {{false}}, not {{true}} at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L247 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fernando Ruben Otero updated SPARK-7670: Attachment: Dockerfile This docker file reproduces the error on my machine Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546794#comment-14546794 ] Fernando Ruben Otero commented on SPARK-7670: - I just attacked a docker file that reproduces the error on master (I just run it) I'm doing binary search on commits to find where the build broke, so far fc17661475443d9f0a8d28e3439feeb7a7bca67b is building Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6439) Show per-task metrics when you hover over a task in the web UI visualization
[ https://issues.apache.org/jira/browse/SPARK-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-6439: -- Assignee: Kousuke Saruta (was: Kay Ousterhout) Show per-task metrics when you hover over a task in the web UI visualization Key: SPARK-6439 URL: https://issues.apache.org/jira/browse/SPARK-6439 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kousuke Saruta Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546847#comment-14546847 ] Fernando Ruben Otero edited comment on SPARK-7670 at 5/16/15 4:34 PM: -- I did the docker file because that generates a clean environment. I don't see any other error before that. was (Author: zeos): I did the docker file because that generates a clean environment Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546847#comment-14546847 ] Fernando Ruben Otero commented on SPARK-7670: - I did the docker file because that generates a clean environment Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Attachments: Dockerfile When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545677#comment-14545677 ] Yana Kadiyska edited comment on SPARK-4412 at 5/16/15 5:11 PM: --- I would like to reopen as I believe the issue has again regressed in Spark 1.3.0. This SO thread has a lengthy discussion http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark but the short summary is that log4j.rootCategory=ERROR, console setting still leaks {quote} INFO: parquet.hadoop.InternalParquetRecordReader {quote} messages was (Author: yanakad): I would like to reopen as I believe the issue has again regressed in Spark 1.3.0. This SO thread has a lengthy discussion http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark but the short summary is that log4j.rootCategory=ERROR, console setting still leaks INFO: parquet.hadoop.InternalParquetRecordReader messages Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6657: - Fix Version/s: (was: 1.3.1) Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6197. -- Resolution: Fixed Fix Version/s: 1.3.2 Target Version/s: 1.3.2, 1.4.0 (was: 1.4.0) I back-ported to 1.3.x. I infer from the previous target and label that this was the work left to do. handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Fix For: 1.3.2, 1.4.0 This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6197: - Labels: (was: backport-needed) handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Fix For: 1.3.2, 1.4.0 This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3928: - Fix Version/s: (was: 1.3.0) Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Assignee: Cheng Lian Priority: Minor {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4325: - Fix Version/s: (was: 1.3.0) Improve spark-ec2 cluster launch times -- Key: SPARK-4325 URL: https://issues.apache.org/jira/browse/SPARK-4325 Project: Spark Issue Type: Umbrella Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor This is an umbrella task to capture several pieces of work related to significantly improving spark-ec2 cluster launch times. There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6657: - Target Version/s: 1.3.2, 1.4.0 (was: 1.3.1, 1.4.0) Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3490) Alleviate port collisions during tests
[ https://issues.apache.org/jira/browse/SPARK-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3490. -- Resolution: Fixed Target Version/s: 1.2.0, 1.1.1, 0.9.3, 1.0.3 (was: 0.9.3, 1.0.3, 1.1.1, 1.2.0) I think it's not likely there would be another 0.9 or 1.0 branch release now. Alleviate port collisions during tests -- Key: SPARK-3490 URL: https://issues.apache.org/jira/browse/SPARK-3490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 0.9.3, 1.2.0, 1.1.1 A few tests, notably SparkSubmitSuite and DriverSuite, have been failing intermittently because we open too many ephemeral ports and occasionally can't bind to new ones. We should minimize the use of ports during tests where possible. A great candidate is the SparkUI, which is not needed for most tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3987. -- Resolution: Fixed From the discussion it sounds like the issue that this JIRA concerns was actually OK. NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.2.0, 1.1.1 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 148311.355507, 42231.381747, -12524.380088, 219025.288975, 203889.695169,
[jira] [Updated] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4258: - Fix Version/s: (was: 1.2.0) NPE with new Parquet Filters Key: SPARK-4258 URL: https://issues.apache.org/jira/browse/SPARK-4258 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): java.lang.NullPointerException: parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$And.accept(Operators.java:290) parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) {code} This occurs when reading parquet data encoded with the older version of the library for TPC-DS query 34. Will work on coming up with a smaller reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2973: - Fix Version/s: (was: 1.2.0) Use LocalRelation for all ExecutedCommands, avoid job for take/collect() Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Critical Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4412: - Fix Version/s: (was: 1.2.0) Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2750: - Fix Version/s: (was: 1.0.3) Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Tao Wang Labels: https, ssl, webui Attachments: exception on yarn when https enabled.txt Original Estimate: 96h Remaining Estimate: 96h Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can switch between https and http by configure spark.http.policy in JVM property for each process, while choose http by default. 2.Web port of Master and worker would be decided in order of launch arguments, JVM property, System Env and default port. 3.Below is some other configuration items: spark.ssl.server.keystore.location The file or URL of the SSL Key store spark.ssl.server.keystore.password The password for the key store spark.ssl.server.keystore.keypassword The password (if any) for the specific key within the key store spark.ssl.server.keystore.type The type of the key store (default JKS) spark.client.https.need-auth True if SSL needs client authentication spark.ssl.server.truststore.location The file name or URL of the trust store location spark.ssl.server.truststore.password The password for the trust store spark.ssl.server.truststore.type The type of the trust store (default JKS) Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7685) Handle high imbalanced data or apply weights to different samples in Logistic Regression
DB Tsai created SPARK-7685: -- Summary: Handle high imbalanced data or apply weights to different samples in Logistic Regression Key: SPARK-7685 URL: https://issues.apache.org/jira/browse/SPARK-7685 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another weight: Double information in the LabeledPoint to weight them differently in the learning algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7685: --- Summary: Handle high imbalanced data and apply weights to different samples in Logistic Regression (was: Handle high imbalanced data or apply weights to different samples in Logistic Regression) Handle high imbalanced data and apply weights to different samples in Logistic Regression - Key: SPARK-7685 URL: https://issues.apache.org/jira/browse/SPARK-7685 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another weight: Double information in the LabeledPoint to weight them differently in the learning algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4556. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6186 [https://github.com/apache/spark/pull/6186] binary distribution assembly can't run in local mode Key: SPARK-4556 URL: https://issues.apache.org/jira/browse/SPARK-4556 Project: Spark Issue Type: Bug Components: Build, Deploy, Documentation Reporter: Sean Busbey Fix For: 1.4.0 After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s] [INFO] Spark Project External Flume ... SUCCESS [01:41 min] [INFO] Spark Project External MQTT SUCCESS [ 40.973 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s] [INFO] Spark Project External Kafka ... SUCCESS [01:23 min] [INFO] Spark Project Examples . SUCCESS [10:19 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:47 h [INFO] Finished at: 2014-11-22T02:13:51-06:00 [INFO] Final Memory: 79M/2759M [INFO] busbey2-MBA:spark busbey$ cd assembly/target/ busbey2-MBA:target busbey$ mkdir dist-temp busbey2-MBA:target busbey$ tar -C dist-temp -xzf spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz busbey2-MBA:target busbey$ cd dist-temp/ busbey2-MBA:dist-temp busbey$ ./bin/spark-shell ls: /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 You need to build Spark before running this program. {code} It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't handle it. If I move all of the spark-*.jar files from the top level into the lib folder and touch the RELEASE file, then the spark shell launches in local mode normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4556) Document that make-distribution.sh is required to make a runnable distribution
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4556: - Component/s: (was: Spark Shell) Documentation Deploy Priority: Minor (was: Major) Assignee: Sean Owen Issue Type: Improvement (was: Bug) Summary: Document that make-distribution.sh is required to make a runnable distribution (was: binary distribution assembly can't run in local mode) Document that make-distribution.sh is required to make a runnable distribution -- Key: SPARK-4556 URL: https://issues.apache.org/jira/browse/SPARK-4556 Project: Spark Issue Type: Improvement Components: Build, Deploy, Documentation Reporter: Sean Busbey Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s] [INFO] Spark Project External Flume ... SUCCESS [01:41 min] [INFO] Spark Project External MQTT SUCCESS [ 40.973 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s] [INFO] Spark Project External Kafka ... SUCCESS [01:23 min] [INFO] Spark Project Examples . SUCCESS [10:19 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:47 h [INFO] Finished at: 2014-11-22T02:13:51-06:00 [INFO] Final Memory: 79M/2759M [INFO] busbey2-MBA:spark busbey$ cd assembly/target/ busbey2-MBA:target busbey$ mkdir dist-temp busbey2-MBA:target busbey$ tar -C dist-temp -xzf spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz busbey2-MBA:target busbey$ cd dist-temp/ busbey2-MBA:dist-temp busbey$ ./bin/spark-shell ls: /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 You need to build Spark before running this program. {code} It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't handle it. If I move all of the spark-*.jar files from the top level into the lib folder and touch the RELEASE file, then the spark shell launches in local mode normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7672) Number format exception with spark.kryoserializer.buffer.mb
[ https://issues.apache.org/jira/browse/SPARK-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7672: - Assignee: Nishkam Ravi Number format exception with spark.kryoserializer.buffer.mb --- Key: SPARK-7672 URL: https://issues.apache.org/jira/browse/SPARK-7672 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nishkam Ravi Assignee: Nishkam Ravi Priority: Critical Fix For: 1.4.0 With spark.kryoserializer.buffer.mb 1000 : Exception in thread main java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. Fractional values are not supported. Input was: 100.0 at org.apache.spark.network.util.JavaUtils.parseByteString(JavaUtils.java:238) at org.apache.spark.network.util.JavaUtils.byteStringAsKb(JavaUtils.java:259) at org.apache.spark.util.Utils$.byteStringAsKb(Utils.scala:1037) at org.apache.spark.SparkConf.getSizeAsKb(SparkConf.scala:245) at org.apache.spark.serializer.KryoSerializer.init(KryoSerializer.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:269) at org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:280) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:283) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7672) Number format exception with spark.kryoserializer.buffer.mb
[ https://issues.apache.org/jira/browse/SPARK-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7672. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6198 [https://github.com/apache/spark/pull/6198] Number format exception with spark.kryoserializer.buffer.mb --- Key: SPARK-7672 URL: https://issues.apache.org/jira/browse/SPARK-7672 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nishkam Ravi Priority: Critical Fix For: 1.4.0 With spark.kryoserializer.buffer.mb 1000 : Exception in thread main java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. Fractional values are not supported. Input was: 100.0 at org.apache.spark.network.util.JavaUtils.parseByteString(JavaUtils.java:238) at org.apache.spark.network.util.JavaUtils.byteStringAsKb(JavaUtils.java:259) at org.apache.spark.util.Utils$.byteStringAsKb(Utils.scala:1037) at org.apache.spark.SparkConf.getSizeAsKb(SparkConf.scala:245) at org.apache.spark.serializer.KryoSerializer.init(KryoSerializer.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:269) at org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:280) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:283) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546617#comment-14546617 ] Tathagata Das commented on SPARK-7661: -- N+1 is used in the example, but isnt really the suggested recommended way. Here is how it works. You have to give X + Y cores, where X = number of Kinesis streams/receivers and Y = number of cores for processing the data. The X receivers will in collaboration with each other receive data from N shards. If you expect your N to vary from 10 to 20, then having X = 15 isnt a bad idea. At N = 20, the 15 receivers wil distribute the work among themselves. And Y should be such that your systems can process the data as fast as it is received. Support for dynamic allocation of executors in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7671) Fix wrong URLs in MLlib Data Types Documentation
[ https://issues.apache.org/jira/browse/SPARK-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546622#comment-14546622 ] Favio Vázquez commented on SPARK-7671: -- Thanks [~josephkb] and [~srowen] for fixing that Fix wrong URLs in MLlib Data Types Documentation Key: SPARK-7671 URL: https://issues.apache.org/jira/browse/SPARK-7671 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.6.0-cdh5.4.0. Reporter: Favio Vázquez Assignee: Favio Vázquez Priority: Trivial Labels: Documentation,, Fix, MLlib,, URL Fix For: 1.4.0 There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$ There is another mistake, in the Local Vector section in Scala, Java and Python In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors. The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546623#comment-14546623 ] Murtaza Kanchwala commented on SPARK-7661: -- Ok, Let me try your solution as well with this scaling util. Support for dynamic allocation of executors in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546632#comment-14546632 ] Apache Spark commented on SPARK-7654: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6210 DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7646) Create table support to JDBC Datasource
[ https://issues.apache.org/jira/browse/SPARK-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7646: --- Labels: 1.4.1 (was: ) Create table support to JDBC Datasource --- Key: SPARK-7646 URL: https://issues.apache.org/jira/browse/SPARK-7646 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Venkata Ramana G Labels: 1.4.1 Support Create table into JDBCDataSource. Following are usage examples {code} df.saveAsTable( testcreate2, org.apache.spark.sql.jdbc, org.apache.spark.sql.SaveMode.Overwrite, Map(url-s$url, dbtable-testcreate2, user-xx, password-xx, driver-com.h2.Driver) ) {code} if table doesnot exists, this should create a table and write dataframe content to table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Murtaza Kanchwala updated SPARK-7661: - Description: Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). was: Currently the logic for the no. of executors is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). Support for dynamic allocation of executors in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7655) Akka timeout exception
[ https://issues.apache.org/jira/browse/SPARK-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7655. Resolution: Fixed Fix Version/s: 1.4.0 Akka timeout exception -- Key: SPARK-7655 URL: https://issues.apache.org/jira/browse/SPARK-7655 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Shixiong Zhu Priority: Blocker Fix For: 1.4.0 I got the following exception when I was running a query with broadcast join. {code} 15/05/15 01:15:49 [WARN] AkkaRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 10.0.171.162, 54870),broadcast_758_piece0,StorageLevel(false, false, false, false, 1),0,0,0)] in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:374) at org.apache.spark.storage.BlockManager.reportBlockStatus(BlockManager.scala:350) at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1107) at org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083) at org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083) at scala.collection.immutable.Set$Set2.foreach(Set.scala:94) at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1083) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:78) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7655) Akka timeout exception from ask and table broadcast
[ https://issues.apache.org/jira/browse/SPARK-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7655: --- Summary: Akka timeout exception from ask and table broadcast (was: Akka timeout exception) Akka timeout exception from ask and table broadcast --- Key: SPARK-7655 URL: https://issues.apache.org/jira/browse/SPARK-7655 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Shixiong Zhu Priority: Blocker Fix For: 1.4.0 I got the following exception when I was running a query with broadcast join. {code} 15/05/15 01:15:49 [WARN] AkkaRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 10.0.171.162, 54870),broadcast_758_piece0,StorageLevel(false, false, false, false, 1),0,0,0)] in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:374) at org.apache.spark.storage.BlockManager.reportBlockStatus(BlockManager.scala:350) at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1107) at org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083) at org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1083) at scala.collection.immutable.Set$Set2.foreach(Set.scala:94) at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1083) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:78) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546647#comment-14546647 ] Apache Spark commented on SPARK-7654: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6211 DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546598#comment-14546598 ] Murtaza Kanchwala commented on SPARK-7661: -- Ok I'll correct my terms, My case is exactly like this https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3c30d8e3e3-95db-492b-8b49-73a99d587...@gmail.com%3E Just the difference is that my no. of spark shards are updating by this utility provided by AWS, and when the no. of shards increase my Spark streaming consumer got's hunged up and and it goes in the waiting state Support for dynamic allocation of executors in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7671) Fix wrong URLs in MLlib Data Types Documentation
[ https://issues.apache.org/jira/browse/SPARK-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7671. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6196 [https://github.com/apache/spark/pull/6196] Fix wrong URLs in MLlib Data Types Documentation Key: SPARK-7671 URL: https://issues.apache.org/jira/browse/SPARK-7671 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.6.0-cdh5.4.0. Reporter: Favio Vázquez Priority: Trivial Labels: Documentation,, Fix, MLlib,, URL Fix For: 1.4.0 There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$ There is another mistake, in the Local Vector section in Scala, Java and Python In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors. The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7671) Fix wrong URLs in MLlib Data Types Documentation
[ https://issues.apache.org/jira/browse/SPARK-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7671: - Assignee: Favio Vázquez Fix wrong URLs in MLlib Data Types Documentation Key: SPARK-7671 URL: https://issues.apache.org/jira/browse/SPARK-7671 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.6.0-cdh5.4.0. Reporter: Favio Vázquez Assignee: Favio Vázquez Priority: Trivial Labels: Documentation,, Fix, MLlib,, URL Fix For: 1.4.0 There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$ There is another mistake, in the Local Vector section in Scala, Java and Python In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors. The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546628#comment-14546628 ] Sean Owen commented on SPARK-7670: -- I can't reproduce this. Master builds fine for me with the same commands. Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero Fix For: 1.4.0 When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546637#comment-14546637 ] Reynold Xin commented on SPARK-7654: TODOs: - Move insertInto also into write. - Python API. - Update usage everywhere outside SQL. - Update programming guide. DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7682) Size of distributed grids still limited by cPickle
Toby Potter created SPARK-7682: -- Summary: Size of distributed grids still limited by cPickle Key: SPARK-7682 URL: https://issues.apache.org/jira/browse/SPARK-7682 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: Redhat Enterprise Linux 6.5, Spark 1.3.1 standalone in cluster mode, 2 nodes with 64 GB spark slaves, Python 2.7.6 Reporter: Toby Potter Priority: Minor I'm trying to explore the possibilities of writing a fault-tolerant distributed computing engine for multidimensional arrays. I'm finding that the Python cPickle serializer is limiting the size of Numpy arrays that I can distribute over the cluster. My example code is below #!/usr/bin/env python #Python app to use spark from pyspark import SparkContext, SparkConf import numpy appName=Spark Test App # Create a spark context conf = SparkConf().setAppName(appName) # Set memory conf = SparkConf().set(spark.executor.memory, 32g) sc = SparkContext(conf=conf) # Make array grid=numpy.zeros((1024,1024,1024)) # Now parallelise and persist the data rdd = sc.parallelize([(srcw, grid)]) # Make the data persist in memory rdd_rdd.persist() When I run the code I get the following error Traceback (most recent call last): File test_app.py, line 20, in module rdd = sc.parallelize([(srcw, grid)]) File /spark/1.3.1/python/pyspark/context.py, line 341, in parallelize serializer.dump_stream(c, tempFile) File /spark/1.3.1/python/pyspark/serializers.py, line 208, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /spark/1.3.1/python/pyspark/serializers.py, line 127, in dump_stream self._write_with_length(obj, stream) File /spark/1.3.1/python/pyspark/serializers.py, line 137, in _write_with_length serialized = self.dumps(obj) File /spark/1.3.1/python/pyspark/serializers.py, line 403, in dumps return cPickle.dumps(obj, 2) SystemError: error return without exception set -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5948) Support writing to partitioned table for the Parquet data source
[ https://issues.apache.org/jira/browse/SPARK-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5948: - Assignee: Michael Armbrust Support writing to partitioned table for the Parquet data source Key: SPARK-5948 URL: https://issues.apache.org/jira/browse/SPARK-5948 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian Assignee: Michael Armbrust Priority: Blocker Fix For: 1.4.0 In 1.3.0, we added support for reading partitioned tables declared in Hive metastore for the Parquet data source. However, writing to partitioned tables is not supported yet. This feature should probably built upon SPARK-5947. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5281: - Assignee: Iulian Dragos Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Reporter: sarsol Assignee: Iulian Dragos Priority: Critical Fix For: 1.4.0 Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5632: - Assignee: Wenchen Fan not able to resolve dot('.') in field name -- Key: SPARK-5632 URL: https://issues.apache.org/jira/browse/SPARK-5632 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.2.0, 1.3.0 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 Reporter: Lishu Liu Assignee: Wenchen Fan Priority: Blocker Fix For: 1.4.0 My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: {code} scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity {code} The full schema look like this: {code} scala task_trace.printSchema() root \|-- received_datetime: long (nullable = true) \|-- task_body: struct (nullable = true) \|\|-- cerberus_batch_id: string (nullable = true) \|\|-- cerberus_id: string (nullable = true) \|\|-- couponId: integer (nullable = true) \|\|-- coupon_code: string (nullable = true) \|\|-- created: string (nullable = true) \|\|-- description: string (nullable = true) \|\|-- domain: string (nullable = true) \|\|-- expires: string (nullable = true) \|\|-- message_id: string (nullable = true) \|\|-- neverShowAfter: string (nullable = true) \|\|-- neverShowBefore: string (nullable = true) \|\|-- offerTitle: string (nullable = true) \|\|-- screenshots: array (nullable = true) \|\|\|-- element: string (containsNull = false) \|\|-- sm.result: struct (nullable = true) \|\|\|-- cerberus_batch_id: string (nullable = true) \|\|\|-- cerberus_id: string (nullable = true) \|\|\|-- code: string (nullable = true) \|\|\|-- couponId: integer (nullable = true) \|\|\|-- created: string (nullable = true) \|\|\|-- description: string (nullable = true) \|\|\|-- domain: string (nullable = true) \|\|\|-- expires: string (nullable = true) \|\|\|-- message_id: string (nullable = true) \|\|\|-- neverShowAfter: string (nullable = true) \|\|\|-- neverShowBefore: string (nullable = true) \|\|\|-- offerTitle: string (nullable = true) \|\|\|-- result: struct (nullable = true) \|\|\|\|-- post: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: boolean (nullable = true) \|\|\|\|\|-- meta: struct (nullable = true) \|\|\|\|\|\|-- None_tx_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- exceptions: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- no_input_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_mapped: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_transformed: array (nullable = true) \|\|\|\|\|\|\|-- element: array (containsNull = false) \|\|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|-- now_price_checkout: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- shipping_price: struct (nullable = true) \|\|\|\|\|
[jira] [Updated] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala
[ https://issues.apache.org/jira/browse/SPARK-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4699: - Assignee: Fei Wang Make caseSensitive configurable in Analyzer.scala - Key: SPARK-4699 URL: https://issues.apache.org/jira/browse/SPARK-4699 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Assignee: Fei Wang Fix For: 1.4.0 Currently, case sensitivity is true by default in Analyzer. It should be configurable by setting SQLConf in the client application -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5947) First class partitioning support in data sources API
[ https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5947: - Assignee: Michael Armbrust First class partitioning support in data sources API Key: SPARK-5947 URL: https://issues.apache.org/jira/browse/SPARK-5947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian Assignee: Michael Armbrust Priority: Blocker Fix For: 1.4.0 For file system based data sources, implementing Hive style partitioning support can be complex and error prone. To be specific, partitioning support include: # Partition discovery: Given a directory organized similar to Hive partitions, discover the directory structure and partitioning information automatically, including partition column names, data types, and values. # Reading from partitioned tables # Writing to partitioned tables It would be good to have first class partitioning support in the data sources API. For example, add a {{FileBasedScan}} trait with callbacks and default implementations for these features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6734) Support GenericUDTF.close for Generate
[ https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6734: - Assignee: Cheng Hao Support GenericUDTF.close for Generate -- Key: SPARK-6734 URL: https://issues.apache.org/jira/browse/SPARK-6734 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.4.0 Some third-party UDTF extension, will generate more rows in the GenericUDTF.close() method, which is supported by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while porting job from Hive to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7109) Push down left side filter for left semi join
[ https://issues.apache.org/jira/browse/SPARK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7109: - Assignee: Fei Wang Push down left side filter for left semi join - Key: SPARK-7109 URL: https://issues.apache.org/jira/browse/SPARK-7109 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Assignee: Fei Wang Fix For: 1.4.0 now in spark sql optimizer we only push down right side filter, actually we can push down left side filter for left semi join -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6439) Show per-task metrics when you hover over a task in the web UI visualization
[ https://issues.apache.org/jira/browse/SPARK-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6439: - Assignee: Kay Ousterhout Show per-task metrics when you hover over a task in the web UI visualization Key: SPARK-6439 URL: https://issues.apache.org/jira/browse/SPARK-6439 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6418) Add simple per-stage visualization to the UI
[ https://issues.apache.org/jira/browse/SPARK-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6418: - Assignee: Kousuke Saruta Add simple per-stage visualization to the UI Key: SPARK-6418 URL: https://issues.apache.org/jira/browse/SPARK-6418 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Kay Ousterhout Assignee: Kousuke Saruta Fix For: 1.4.0 Attachments: Screen Shot 2015-03-18 at 6.13.04 PM.png Visualizing how tasks in a stage spend their time can be very helpful to understanding performance. Many folks have started using the visualization tools here: https://github.com/kayousterhout/trace-analysis (see the README at the bottom) to analyze their jobs after they've finished running, but it would be great if this functionality were natively integrated into Spark's UI. I'd propose adding a relatively simple visualization to the stage detail page, that's hidden by default but that users can view by clicking on a drop-down menu. The plan is to implement this using D3; a mock up of how this would look (that uses D3) is attached. One change we'll make for the initial implementation, compared to the attached visualization, is tasks will be sorted by start time. This is intended to be a much simpler and more limited version of SPARK-3468 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7437) Fold literal in (item1, item2, ..., literal, ...) into true or false directly
[ https://issues.apache.org/jira/browse/SPARK-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7437: - Assignee: Zhongshuai Pei Fold literal in (item1, item2, ..., literal, ...) into true or false directly --- Key: SPARK-7437 URL: https://issues.apache.org/jira/browse/SPARK-7437 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Zhongshuai Pei Assignee: Zhongshuai Pei Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7504) NullPointerException when initializing SparkContext in YARN-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7504: - Assignee: Zoltán Zvara NullPointerException when initializing SparkContext in YARN-cluster mode Key: SPARK-7504 URL: https://issues.apache.org/jira/browse/SPARK-7504 Project: Spark Issue Type: Bug Components: Deploy, YARN Reporter: Zoltán Zvara Assignee: Zoltán Zvara Labels: deployment, yarn, yarn-client Fix For: 1.4.0 It is not clear for most users that, while running Spark on YARN a {{SparkContext}} with a given execution plan can be run locally as {{yarn-client}}, but can not deploy itself to the cluster. This is currently performed using {{org.apache.spark.deploy.yarn.Client}}. {color:gray} I think we should support deployment through {{SparkContext}}, but this is not the point I wish to make here. {color} Configuring a {{SparkContext}} to deploy itself currently will yield an {{ERROR}} while accessing {{spark.yarn.app.id}} in {{YarnClusterSchedulerBackend}}, and after that a {{NullPointerException}} while referencing the {{ApplicationMaster}} instance. Spark should clearly inform the user that it might be running in {{yarn-cluster}} mode without a proper submission using {{Client}} and that deploying is not supported directly from {{SparkContext}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7595) Window will cause resolve failed with self join
[ https://issues.apache.org/jira/browse/SPARK-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7595: - Assignee: Weizhong Window will cause resolve failed with self join --- Key: SPARK-7595 URL: https://issues.apache.org/jira/browse/SPARK-7595 Project: Spark Issue Type: Bug Components: SQL Reporter: Weizhong Assignee: Weizhong Priority: Minor Fix For: 1.4.0 for example: table: src(key string, value string) sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5; then will analyze fail when resolving conflicting references in Join: 'Limit 5 'Project [*] 'Subquery v2 'Project ['v1.key,'v1_lag.cnt_val] 'Filter ('v1.key = 'v1_lag.key) 'Join Inner, None Subquery v1 Project [key#95,cnt_val#94L] Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [key#95,value#96] MetastoreRelation default, src, None Subquery v1_lag Subquery v1 Project [key#97,cnt_val#94L] Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [key#97,value#98] MetastoreRelation default, src, None Conflicting attributes: cnt_val#94L -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7598) Add aliveWorkers metrics in Master
[ https://issues.apache.org/jira/browse/SPARK-7598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7598: - Assignee: Rex Xiong Add aliveWorkers metrics in Master -- Key: SPARK-7598 URL: https://issues.apache.org/jira/browse/SPARK-7598 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.3.1 Reporter: Rex Xiong Assignee: Rex Xiong Priority: Minor Fix For: 1.4.0 In Spark Standalone setup, when some workers are DEAD, they will stay in master worker list for a while. master.workers metrics for master is only showing the total number of workers, we need to monitor how many real ALIVE workers are there to ensure the cluster is healthy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7601) Support Insert into JDBC Datasource
[ https://issues.apache.org/jira/browse/SPARK-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7601: - Assignee: Venkata Ramana G Support Insert into JDBC Datasource --- Key: SPARK-7601 URL: https://issues.apache.org/jira/browse/SPARK-7601 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Venkata Ramana G Assignee: Venkata Ramana G Fix For: 1.4.0 Support Insert into JDBCDataSource. Following are usage examples {code} sqlContext.sql( s |CREATE TEMPORARY TABLE testram1 |USING org.apache.spark.sql.jdbc |OPTIONS (url '$url', dbtable 'testram1', user 'xx', password 'xx', driver 'com.h2.Driver') .stripMargin.replaceAll(\n, )) sqlContext.sql(insert into table testram1 select * from testsrc).show {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5782: - Priority: Major (was: Blocker) Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.2.1, 1.2.2, 1.3.0 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7269) Incorrect aggregation analysis
[ https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7269: - Priority: Major (was: Blocker) Incorrect aggregation analysis -- Key: SPARK-7269 URL: https://issues.apache.org/jira/browse/SPARK-7269 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao In a case insensitive analyzer (HiveContext), the attribute name captial differences will fail the analysis check for aggregation. {code} test(check analysis failed in case in-sensitive) { Seq(1,2,3).map(i = (i, i.toString)).toDF(key, value).registerTempTable(df_analysis) sql(SELECT kEy from df_analysis group by key) } {code} {noformat} expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6680) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration
[ https://issues.apache.org/jira/browse/SPARK-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6680: - Priority: Minor (was: Blocker) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration --- Key: SPARK-6680 URL: https://issues.apache.org/jira/browse/SPARK-6680 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.3.0 Environment: Docker. Reporter: Egor Pakhomov Priority: Minor Labels: core, deploy, docker Suppose I have 3 docker containers - spark_master, spark_worker and spark_shell. In docker for public IP of this container there is an alias like fgsdfg454534. It only visible in this container. When spark use it for communication other containers receive this alias and don't know what to do with it. Thats why I used SPARK_LOCAL_IP for master and worker. But it doesn't work for spark driver(for spark shell - other types of drivers I haven't try). Spark driver sent everyone fgsdfg454534 alias about itself and then nobody can address it. I've overcome it in https://github.com/epahomov/docker-spark, but it would be better if it would be solved on spark code level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7119) ScriptTransform doesn't consider the output data type
[ https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7119: - Priority: Major (was: Blocker) ScriptTransform doesn't consider the output data type - Key: SPARK-7119 URL: https://issues.apache.org/jira/browse/SPARK-7119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Hao {panel} from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2; {panel} {panel} 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7523) ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
[ https://issues.apache.org/jira/browse/SPARK-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7523. -- Resolution: Invalid I think this should start as a discussion on the mailing list. It's not clear this is a Spark problem. ERROR LiveListenerBus: Listener EventLoggingListener threw an exception --- Key: SPARK-7523 URL: https://issues.apache.org/jira/browse/SPARK-7523 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Environment: Prod Reporter: sagar Priority: Blocker Attachments: schema.txt, spark-0.0.1-SNAPSHOT.jar Hi Team, I am using CDH 5.4 with spark 1.3.0. I am getting below error while executing below command - I see jira's (SPARK-2906/SPARK-1407) specifying the issue is resolved, but i didnt get any solution what the fix for that. Can you pls guide/suggest as this is production issue. $ spark-submit --master local[4] --class org.sample.spark.SparkFilter --name Spark Sample Program spark-0.0.1-SNAPSHOT.jar /user/user1/schema.txt == 15/05/11 06:28:36 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:169) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:792) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1998) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1959) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) ... 19 more == -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4452: - Priority: Major (was: Critical) Target Version/s: (was: 1.1.2, 1.2.1, 1.3.0) Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
[ https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5205: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI -- Key: SPARK-5205 URL: https://issues.apache.org/jira/browse/SPARK-5205 Project: Spark Issue Type: Bug Components: Streaming Reporter: uncleGen The kill link is used to kill a stage in job. It works in any kinds of Spark job but Spark Streaming. To be specific, we can only kill the stage which is used to run Receiver, but not kill the Receivers. Well, the stage can be killed and cleaned from the ui, but the receivers are still alive and receiving data. I think it dose not fit with the common sense. IMHO, killing the receiver stage means kill the receivers and stopping receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4888) Spark EC2 doesn't mount local disks for i2.8xlarge instances
[ https://issues.apache.org/jira/browse/SPARK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4888: - Target Version/s: 1.5.0 (was: 1.0.3, 1.1.2, 1.2.1, 1.3.0) Spark EC2 doesn't mount local disks for i2.8xlarge instances Key: SPARK-4888 URL: https://issues.apache.org/jira/browse/SPARK-4888 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Josh Rosen Priority: Critical When launching a cluster using {{spark-ec2}} with i8.2xlarge instances, the local disks aren't mounted. The AWS console doesn't show the disks as mounted, either I think that the issue is that EC2 won't auto-mount the SSDs. We have some code that handles this for some of the {{r3*}} instance types, and I think the right fix is to extend this for {{i8}} instance types, too: https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L37 /cc [~adav], who originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6174) Improve doc: Python ALS, MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6174: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Improve doc: Python ALS, MatrixFactorizationModel - Key: SPARK-6174 URL: https://issues.apache.org/jira/browse/SPARK-6174 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor The Python docs for recommendation have almost no content except an example. Add class, method attribute descriptions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4227) Document external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4227: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Document external shuffle service - Key: SPARK-4227 URL: https://issues.apache.org/jira/browse/SPARK-4227 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical We should add spark.shuffle.service.enabled to the Configuration page and give instructions for launching the shuffle service as an auxiliary service on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6266: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) PySpark SparseVector missing doc for size, indices, values -- Key: SPARK-6266 URL: https://issues.apache.org/jira/browse/SPARK-6266 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6173) Python doc parity with Scala/Java in MLlib
[ https://issues.apache.org/jira/browse/SPARK-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6173: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Python doc parity with Scala/Java in MLlib -- Key: SPARK-6173 URL: https://issues.apache.org/jira/browse/SPARK-6173 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This is an umbrella JIRA for noting parts of the Python API in MLlib which are significantly less well-documented than the Scala/Java docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6270) Standalone Master hangs when streaming job completes
[ https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6270: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Standalone Master hangs when streaming job completes Key: SPARK-6270 URL: https://issues.apache.org/jira/browse/SPARK-6270 Project: Spark Issue Type: Bug Components: Deploy, Streaming Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Tathagata Das Priority: Critical If the event logging is enabled, the Spark Standalone Master tries to recreate the web UI of a completed Spark application from its event logs. However if this event log is huge (e.g. for a Spark Streaming application), then the master hangs in its attempt to read and recreate the web ui. This hang causes the whole standalone cluster to be unusable. Workaround is to disable the event logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6265) PySpark GLMs missing doc for intercept, weights
[ https://issues.apache.org/jira/browse/SPARK-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6265: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) PySpark GLMs missing doc for intercept, weights --- Key: SPARK-6265 URL: https://issues.apache.org/jira/browse/SPARK-6265 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark MLlib, the GLMs (e.g., LinearRegressionModel) have no documentation for the intercept and weights attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself
[ https://issues.apache.org/jira/browse/SPARK-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6632: - Fix Version/s: (was: 1.4.0) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself --- Key: SPARK-6632 URL: https://issues.apache.org/jira/browse/SPARK-6632 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Currently in ParquetRelation2, schema from all the part files is first merged, and then reconciled with metastore schema. This approach does not scale in case we have thousands of partitions for the table. We can take a different approach where we can go ahead with the metastore schema, and reconcile the names of the columns within each map task , using ReadSupport hooks provided in parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver
[ https://issues.apache.org/jira/browse/SPARK-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7563: - Fix Version/s: (was: 1.4.0) OutputCommitCoordinator.stop() should only be executed in driver Key: SPARK-7563 URL: https://issues.apache.org/jira/browse/SPARK-7563 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo) Spark 1.3.1 Release Reporter: Hailong Wen Priority: Critical I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with EGO (a resource management product). In EGO we uses fine-grained dynamic allocation policy, and each Executor will exit after its tasks are all done. When testing *spark-shell*, we find that when executor of first job exit, it will stop OutputCommitCoordinator, which result in all future jobs failing. Details are as follows: We got the following error in executor when submitting job in *spark-shell* the second time (the first job submission is successful): {noformat} 15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator Exception in thread main akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), Path(/user/OutputCommitCoordinator)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {noformat} And in driver side, we see a log message telling that the OutputCommitCoordinator is stopped after the first submission: {noformat} 15/05/11 04:01:23 INFO spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped! {noformat} We examine the code of OutputCommitCoordinator, and find that executor will reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor exits, it will eventually call SparkEnv.stop(): {noformat} private[spark] def stop() { isStopped = true pythonWorkers.foreach { case(key, worker) = worker.stop() } Option(httpFileServer).foreach(_.stop()) mapOutputTracker.stop() shuffleManager.stop() broadcastManager.stop() blockManager.stop() blockManager.master.stop() metricsSystem.stop() outputCommitCoordinator.stop() --- actorSystem.shutdown() .. {noformat}
[jira] [Updated] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge
[ https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6378: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) srcAttr in graph.triplets don't update when the size of graph is huge - Key: SPARK-6378 URL: https://issues.apache.org/jira/browse/SPARK-6378 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.1 Reporter: zhangzhenyue when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the srcAttr and dstAttr in graph.triplets don't update when using the Graph.outerJoinVertices(when the data in vertex is changed). the code and the log is as follows: {quote} g = graph.outerJoinVertices()... g,vertices,count() g.edges.count() println(example edge + g.triplets.filter(e = e.srcId == 51L).collect() .map(e =(e.srcId + : + e.srcAttr + , + e.dstId + : + e.dstAttr)).mkString(\n)) println(example vertex + g.vertices.filter(e = e._1 == 51L).collect() .map(e = (e._1 + , + e._2)).mkString(\n)) {quote} the result: {quote} example edge 51:0, 2467451620:61 51:0, 1962741310:83 // attr of vertex 51 is 0 in Graph.triplets example vertex 51,2 // attr of vertex 51 is 2 in Graph.vertices {quote} when the graph is smaller(10 million vertex), the code is OK, the triplets will update when the vertex is changed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6701) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application
[ https://issues.apache.org/jira/browse/SPARK-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6701: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application - Key: SPARK-6701 URL: https://issues.apache.org/jira/browse/SPARK-6701 Project: Spark Issue Type: Bug Components: Tests, YARN Affects Versions: 1.3.0 Reporter: Andrew Or Priority: Critical Observed in Master and 1.3, both in SBT and in Maven (with YARN). {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties, --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) exited with code 1 sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties, --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6981: - Fix Version/s: (was: 1.4.0) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext Key: SPARK-6981 URL: https://issues.apache.org/jira/browse/SPARK-6981 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Edoardo Vacchi Priority: Minor In order to simplify extensibility with new strategies from third-parties, it should be better to factor SparkPlanner and QueryExecution in their own classes. Dependent types add additional, unnecessary complexity; besides, HiveContext would benefit from this change as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly
[ https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6484: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Ganglia metrics xml reporter doesn't escape correctly - Key: SPARK-6484 URL: https://issues.apache.org/jira/browse/SPARK-6484 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Michael Armbrust Assignee: Josh Rosen Priority: Critical The following should be escaped: {code} quot; ' apos; lt; gt; amp; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag
[ https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7606: - Fix Version/s: (was: 1.4.0) Document all PySpark SQL/DataFrame public methods with @since tag - Key: SPARK-7606 URL: https://issues.apache.org/jira/browse/SPARK-7606 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Nicholas Chammas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7444) Eliminate noisy css warn/error logs for UISeleniumSuite
[ https://issues.apache.org/jira/browse/SPARK-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7444: - Fix Version/s: (was: 1.4.0) Eliminate noisy css warn/error logs for UISeleniumSuite --- Key: SPARK-7444 URL: https://issues.apache.org/jira/browse/SPARK-7444 Project: Spark Issue Type: Improvement Components: Tests Reporter: Shixiong Zhu Priority: Minor Eliminate the following noisy logs for {{UISeleniumSuite}}: {code} 15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Error in style rule. (Invalid token *. Was expecting one of: EOF, S, IDENT, }, ;.) 15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Ignoring the following declarations in this rule. 15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Error in style rule. (Invalid token *. Was expecting one of: EOF, S, IDENT, }, ;.) 15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Ignoring the following declarations in this rule. 15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Error in style rule. (Invalid token *. Was expecting one of: EOF, S, IDENT, }, ;.) 15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Ignoring the following declarations in this rule. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold
[ https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7097: - Fix Version/s: (was: 1.4.0) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold --- Key: SPARK-7097 URL: https://issues.apache.org/jira/browse/SPARK-7097 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1 Reporter: Yash Datta Currently when deciding about whether to create HashJoin or ShuffleHashJoin, the size estimation of partitioned tables involved considers the size of entire table. This results in many query plans using shuffle hash joins , where infact only a small number of partitions may be being referred by the actual query (due to additional filters), and hence these could be run using BroadCastHash join. The query plan should consider the size of only the referred partitions in such cases -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6828) Spark returns misleading message when client is incompatible with server
[ https://issues.apache.org/jira/browse/SPARK-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6828: - Fix Version/s: (was: 1.4.0) Spark returns misleading message when client is incompatible with server Key: SPARK-6828 URL: https://issues.apache.org/jira/browse/SPARK-6828 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Client: Windows 7 spark-core v.1.3.0 Server: RedHat 6.6 spark-core v.1.2.0 Reporter: Alexander Ulanov Priority: Minor Client code: val conf = new SparkConf(). setMaster(spark://mynetwrok.com:7077). setAppName(myapp). val sc = new SparkContext(conf) Server reply: 5/04/09 15:35:22 INFO client.AppClient$ClientActor: Connecting to master akka.tcp://sparkmas...@mynetwrok.com:7077/user/Master... 15/04/09 15:35:22 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@mynetwork:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7527) Wrong detection of REPL mode in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7527: - Fix Version/s: (was: 1.4.0) Wrong detection of REPL mode in ClosureCleaner -- Key: SPARK-7527 URL: https://issues.apache.org/jira/browse/SPARK-7527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Oleksii Kostyliev Priority: Minor If REPL class is not present on the classpath, the {{inIntetpreter}} boolean switch shall be {{false}}, not {{true}} at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L247 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6803: - Fix Version/s: (was: 1.4.0) [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7316) Add step capability to RDD sliding window
[ https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7316: - Fix Version/s: (was: 1.4.0) Add step capability to RDD sliding window - Key: SPARK-7316 URL: https://issues.apache.org/jira/browse/SPARK-7316 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Alexander Ulanov Original Estimate: 24h Remaining Estimate: 24h RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Although one can generate sliding windows with step 1 and then filter every Nth window, it might take much more time and disk space depending on the step size. For example, if your window is 1000 then you will generate the amount of data thousand times bigger than your initial dataset. It does not make sense if you need just every Nth window, so the data generated will be 1000/N smaller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6828) Spark returns misleading message when client is incompatible with server
[ https://issues.apache.org/jira/browse/SPARK-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6828: - Target Version/s: (was: 1.4.0) Spark returns misleading message when client is incompatible with server Key: SPARK-6828 URL: https://issues.apache.org/jira/browse/SPARK-6828 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Client: Windows 7 spark-core v.1.3.0 Server: RedHat 6.6 spark-core v.1.2.0 Reporter: Alexander Ulanov Priority: Minor Client code: val conf = new SparkConf(). setMaster(spark://mynetwrok.com:7077). setAppName(myapp). val sc = new SparkContext(conf) Server reply: 5/04/09 15:35:22 INFO client.AppClient$ClientActor: Connecting to master akka.tcp://sparkmas...@mynetwrok.com:7077/user/Master... 15/04/09 15:35:22 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@mynetwork:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7670) Failure when building with scala 2.11 (after 1.3.1
[ https://issues.apache.org/jira/browse/SPARK-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7670: - Fix Version/s: (was: 1.4.0) Failure when building with scala 2.11 (after 1.3.1 -- Key: SPARK-7670 URL: https://issues.apache.org/jira/browse/SPARK-7670 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Fernando Ruben Otero When trying to build spark with scala 2.11 on revision c64ff8036cc6bc7c87743f4c751d7fe91c2e366a (the one on master when I'm submitting this issue) I'm getting export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.6.0 -DskipTests clean install ... ... ... [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ spark-network-shuffle_2.11 --- /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { return Type.UPLOAD_BLOCK; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37: error: not found: type Type protected Type type() { return Type.STREAM_HANDLE; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: error: not found: type Type protected Type type() { return Type.REGISTER_EXECUTOR; } ^ /Users/ZeoS/dev/bigdata/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: error: not found: type Type protected Type type() { return Type.OPEN_BLOCKS; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7627) DAG visualization: cached RDDs not shown on job page
[ https://issues.apache.org/jira/browse/SPARK-7627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7627: - Fix Version/s: (was: 1.4.0) DAG visualization: cached RDDs not shown on job page Key: SPARK-7627 URL: https://issues.apache.org/jira/browse/SPARK-7627 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or It's a small styling issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7658) Update the mouse behaviors for the timeline graphs
[ https://issues.apache.org/jira/browse/SPARK-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7658: - Fix Version/s: (was: 1.4.0) Update the mouse behaviors for the timeline graphs -- Key: SPARK-7658 URL: https://issues.apache.org/jira/browse/SPARK-7658 Project: Spark Issue Type: Improvement Components: Streaming, Web UI Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7224) Mock repositories for testing with --packages
[ https://issues.apache.org/jira/browse/SPARK-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7224: - Fix Version/s: (was: 1.4.0) Mock repositories for testing with --packages - Key: SPARK-7224 URL: https://issues.apache.org/jira/browse/SPARK-7224 Project: Spark Issue Type: Test Components: Spark Submit Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Critical Create mock repositories (folders with jars and pom in MAven format) for testing --packages without the need for internet connection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6197: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Labels: backport-needed This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7245) Spearman correlation for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7245: - Fix Version/s: (was: 1.4.0) Spearman correlation for DataFrames --- Key: SPARK-7245 URL: https://issues.apache.org/jira/browse/SPARK-7245 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Xiangrui Meng Spearman correlation is harder than Pearson to compute. ~~~ df.stat.corr(col1, col2, method=spearman): Double ~~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7498) Params.setDefault should not use varargs annotation
[ https://issues.apache.org/jira/browse/SPARK-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7498: - Fix Version/s: (was: 1.4.0) Params.setDefault should not use varargs annotation --- Key: SPARK-7498 URL: https://issues.apache.org/jira/browse/SPARK-7498 Project: Spark Issue Type: Bug Components: Java API, ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Xiangrui Meng In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me. However, @mengxr reported issues compiling on his machine. So I'm reverting the change introduced in [https://github.com/apache/spark/pull/5960] by removing varargs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6216) Check Python version in worker before run PySpark job
[ https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6216: - Fix Version/s: (was: 1.4.0) Check Python version in worker before run PySpark job - Key: SPARK-6216 URL: https://issues.apache.org/jira/browse/SPARK-6216 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu PySpark can only run with the same major version both in driver and worker ( both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in driver or 2.6 in worker (or vice). For example: {code} davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark Using Python version 2.7.7 (default, Jun 2 2014 12:48:16) SparkContext available as sc, SQLContext available as sqlCtx. sc.textFile('LICENSE').map(lambda l: l.split()).count() org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main process() File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func return f(iterator) File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File stdin, line 1, in lambda TypeError: 'bool' object is not callable at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7287) Flaky test: o.a.s.deploy.SparkSubmitSuite --packages
[ https://issues.apache.org/jira/browse/SPARK-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7287: - Fix Version/s: (was: 1.4.0) Flaky test: o.a.s.deploy.SparkSubmitSuite --packages Key: SPARK-7287 URL: https://issues.apache.org/jira/browse/SPARK-7287 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Burak Yavuz Priority: Critical Labels: flaky-test Error message was not helpful (did not complete within 60 seconds or something). Observed only in master: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/2239/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/2238/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2163/ ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6197: - Fix Version/s: 1.4.0 handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Labels: backport-needed Fix For: 1.4.0 This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6197: - Fix Version/s: (was: 1.4.0) handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Labels: backport-needed This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2155) Support effectful / non-deterministic key expressions in CASE WHEN statements
[ https://issues.apache.org/jira/browse/SPARK-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2155: - Assignee: Wenchen Fan Support effectful / non-deterministic key expressions in CASE WHEN statements - Key: SPARK-2155 URL: https://issues.apache.org/jira/browse/SPARK-2155 Project: Spark Issue Type: Bug Components: SQL Reporter: Zongheng Yang Assignee: Wenchen Fan Priority: Minor Fix For: 1.4.0 Currently we translate CASE KEY WHEN to CASE WHEN, hence incurring redundant evaluations of the key expression. Relevant discussions here: https://github.com/apache/spark/pull/1055/files#r13784248 If we are very in need of support for effectful key expressions, at least we can resort to the baseline approach of having both CaseWhen and CaseKeyWhen as expressions, which seem to introduce much code duplication (e.g. see https://github.com/concretevitamin/spark/blob/47d406a58d129e5bba68bfadf9dd1faa9054d834/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L216 for a sketch implementation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7093) Using newPredicate in NestedLoopJoin to enable code generation
[ https://issues.apache.org/jira/browse/SPARK-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7093: - Assignee: Fei Wang Using newPredicate in NestedLoopJoin to enable code generation -- Key: SPARK-7093 URL: https://issues.apache.org/jira/browse/SPARK-7093 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Fei Wang Assignee: Fei Wang Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7123) support table.star in sqlcontext
[ https://issues.apache.org/jira/browse/SPARK-7123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7123: - Assignee: Fei Wang support table.star in sqlcontext Key: SPARK-7123 URL: https://issues.apache.org/jira/browse/SPARK-7123 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Assignee: Fei Wang Fix For: 1.4.0 support this sql SELECT r.* FROM testData l join testData2 r on (l.key = r.a) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org