[jira] [Commented] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637044#comment-14637044 ] Apache Spark commented on SPARK-9254: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7597 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9254: --- Assignee: Cheng Lian (was: Apache Spark) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9254: --- Assignee: Apache Spark (was: Cheng Lian) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-9254. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7597 [https://github.com/apache/spark/pull/7597] sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.5.0 The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9254: -- Target Version/s: 1.4.2, 1.5.0 (was: 1.5.0) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Labels: backport-needed Fix For: 1.5.0 The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3
[ https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637043#comment-14637043 ] Apache Spark commented on SPARK-9253: - User 'ziky90' has created a pull request for this issue: https://github.com/apache/spark/pull/7596 Allow to create machines with different AWS credentials than will be used for accessing the S3 -- Key: SPARK-9253 URL: https://issues.apache.org/jira/browse/SPARK-9253 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.4.1 Reporter: Jan Zikeš Currently when you would like to use `spark_ec2.py` script together with S3 you have only the option to use exactly the same AWS credentials for both EC2 machines creation and access to S3. For security reasons I would very much appreciate to be able to access S3 with different credentials that with which I am launching machines. Proposed solution is adding the option that I can use for passing additional credentials in the `spark_ec2.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3
[ https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9253: --- Assignee: (was: Apache Spark) Allow to create machines with different AWS credentials than will be used for accessing the S3 -- Key: SPARK-9253 URL: https://issues.apache.org/jira/browse/SPARK-9253 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.4.1 Reporter: Jan Zikeš Currently when you would like to use `spark_ec2.py` script together with S3 you have only the option to use exactly the same AWS credentials for both EC2 machines creation and access to S3. For security reasons I would very much appreciate to be able to access S3 with different credentials that with which I am launching machines. Proposed solution is adding the option that I can use for passing additional credentials in the `spark_ec2.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
Cheng Lian created SPARK-9254: - Summary: sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9254: Labels: backport-needed (was: ) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Labels: backport-needed Fix For: 1.5.0 The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection
[ https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reopened SPARK-9254: --- Reopening this since we need to backport this fix to branch-1.4. sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection -- Key: SPARK-9254 URL: https://issues.apache.org/jira/browse/SPARK-9254 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Labels: backport-needed Fix For: 1.5.0 The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3
[ https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9253: --- Assignee: Apache Spark Allow to create machines with different AWS credentials than will be used for accessing the S3 -- Key: SPARK-9253 URL: https://issues.apache.org/jira/browse/SPARK-9253 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.4.1 Reporter: Jan Zikeš Assignee: Apache Spark Currently when you would like to use `spark_ec2.py` script together with S3 you have only the option to use exactly the same AWS credentials for both EC2 machines creation and access to S3. For security reasons I would very much appreciate to be able to access S3 with different credentials that with which I am launching machines. Proposed solution is adding the option that I can use for passing additional credentials in the `spark_ec2.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9192) add initialization phase for nondeterministic expression
[ https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-9192: --- Summary: add initialization phase for nondeterministic expression (was: add initialization phase for expression) add initialization phase for nondeterministic expression Key: SPARK-9192 URL: https://issues.apache.org/jira/browse/SPARK-9192 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Some expressions have mutable states and need to be initialized first(like Rand, WeekOfYear). Currently we use `@transient lazy val` to make it automatically get initialized when first use it, and reset it after serialize-deserialize. However, this approach is kind of ugly and accessing a lazy val is not efficient, we should have a explicit initialization phase for expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8364) Add crosstab to SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8364: - Shepherd: Shivaram Venkataraman Add crosstab to SparkR DataFrames - Key: SPARK-8364 URL: https://issues.apache.org/jira/browse/SPARK-8364 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Xiangrui Meng Assignee: Xiangrui Meng Add `crosstab` to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to `table` in R. However, `table` in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for `crosstab` to be compatible with Scala/Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9230) SparkR RFormula should support StringType features
[ https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9230: - Target Version/s: 1.5.0 SparkR RFormula should support StringType features -- Key: SPARK-9230 URL: https://issues.apache.org/jira/browse/SPARK-9230 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Eric Liang Assignee: Eric Liang StringType features will need to be encoded using OneHotEncoder to be used for regression. See umbrella design doc https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9230) SparkR RFormula should support StringType features
[ https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9230: - Assignee: Eric Liang SparkR RFormula should support StringType features -- Key: SPARK-9230 URL: https://issues.apache.org/jira/browse/SPARK-9230 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Eric Liang Assignee: Eric Liang StringType features will need to be encoded using OneHotEncoder to be used for regression. See umbrella design doc https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9192) add initialization phase for nondeterministic expression
[ https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-9192: --- Description: Currently nondeterministic expression is broken without a explicit initialization phase. Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `@transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add a a explicit initialization phase. For now no nondeterministic expression need this feature, but we may add new ones with the need of a different initial value for mutable state in the future. was: Some expressions have mutable states and need to be initialized first(like Rand, WeekOfYear). Currently we use `@transient lazy val` to make it automatically get initialized when first use it, and reset it after serialize-deserialize. However, this approach is kind of ugly and accessing a lazy val is not efficient, we should have a explicit initialization phase for expressions. add initialization phase for nondeterministic expression Key: SPARK-9192 URL: https://issues.apache.org/jira/browse/SPARK-9192 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Currently nondeterministic expression is broken without a explicit initialization phase. Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `@transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add a a explicit initialization phase. For now no nondeterministic expression need this feature, but we may add new ones with the need of a different initial value for mutable state in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9192) add initialization phase for nondeterministic expression
[ https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637124#comment-14637124 ] Wenchen Fan commented on SPARK-9192: hi [~lian cheng], sorry about leaving the description empty at the beginning, as it's a result of an offline discussion with rxin. I have updated it, does this make sense to you? add initialization phase for nondeterministic expression Key: SPARK-9192 URL: https://issues.apache.org/jira/browse/SPARK-9192 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Currently nondeterministic expression is broken without a explicit initialization phase. Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `@transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add a a explicit initialization phase. For now no nondeterministic expression need this feature, but we may add new ones with the need of a different initial value for mutable state in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9256) Message delay causes Master crash upon registering application
Colin Scott created SPARK-9256: -- Summary: Message delay causes Master crash upon registering application Key: SPARK-9256 URL: https://issues.apache.org/jira/browse/SPARK-9256 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Colin Scott Priority: Minor This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page
[ https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637299#comment-14637299 ] Ryan Williams commented on SPARK-1301: -- [~srowen] I read this as referring to the Aggregated Metrics by Executor pane on the stage page, which is not in the Executors tab; it causes the more commonly accessed per-task table on the stage page to be many screen-heights below the fold when the number of executors is large. A similar argument could be made about the Distribution Across Executors table on the RDD page. Add UI elements to collapse Aggregated Metrics by Executor pane on stage page --- Key: SPARK-1301 URL: https://issues.apache.org/jira/browse/SPARK-1301 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Matei Zaharia Priority: Minor Labels: Starter This table is useful but it takes up a lot of space on larger clusters, hiding the more commonly accessed stage page. We could also move the table below if collapsing it is difficult. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4024) Remember user preferences for metrics to show in the UI
[ https://issues.apache.org/jira/browse/SPARK-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637306#comment-14637306 ] Ryan Williams commented on SPARK-4024: -- FWIW it seemed like [~zsxwing] solved some of this in https://issues.apache.org/jira/browse/SPARK-4598 / https://github.com/apache/spark/pull/7399 Remember user preferences for metrics to show in the UI --- Key: SPARK-4024 URL: https://issues.apache.org/jira/browse/SPARK-4024 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Priority: Minor We should remember the metrics a user has previously chosen to display for each stage, so that the user doesn't need to reselect interesting metric each time they open a stage detail page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux
[ https://issues.apache.org/jira/browse/SPARK-9255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-9255: --- Attachment: timestamp_bug.zip the project can run without issues. But when it is deployed to the Timestamp handling incorrect for Spark 1.4.1 on Linux - Key: SPARK-9255 URL: https://issues.apache.org/jira/browse/SPARK-9255 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release. Reporter: Paul Wu Attachments: timestamp_bug.zip This is a very strange case involving timestamp I can run the program on Windows using dev pom.xml (1.4.1) or 1.3.1 release downloaded from Apache without issues , but when I ran it on Spark 1.4.1 release either downloaded from Apache or the version built with scala 2.11 on redhat linux, it has the following error (the code I used is after this stack trace): 15/07/22 12:02:50 ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 0) java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102) at org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:422) at
[jira] [Updated] (SPARK-7075) Project Tungsten: Improving Physical Execution
[ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7075: --- Target Version/s: 1.6.0 (was: ) Project Tungsten: Improving Physical Execution -- Key: SPARK-7075 URL: https://issues.apache.org/jira/browse/SPARK-7075 Project: Spark Issue Type: Epic Components: Block Manager, Shuffle, Spark Core, SQL Reporter: Reynold Xin Assignee: Reynold Xin Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware. *Memory Management and Binary Processing* - Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. - Minimizing memory usage through denser in-memory data format, which means we spill less. - Better memory accounting (size of bytes) rather than relying on heuristics - For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization *Cache-aware Computation* - Faster sorting and hashing for aggregations, joins, and shuffle *Code Generation* - Faster expression evaluation and DataFrame/SQL operators - Faster serializer Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux
Paul Wu created SPARK-9255: -- Summary: Timestamp handling incorrect for Spark 1.4.1 on Linux Key: SPARK-9255 URL: https://issues.apache.org/jira/browse/SPARK-9255 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release. Reporter: Paul Wu This is a very strange case involving timestamp I can run the program on Windows using dev pom.xml (1.4.1) or 1.3.1 release downloaded from Apache without issues , but when I ran it on Spark 1.4.1 release either downloaded from Apache or the version built with scala 2.11 on redhat linux, it has the following error (the code I used is after this stack trace): 15/07/22 12:02:50 ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 0) java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102) at org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:422) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.liftedTree2$1(ToolBoxFactory.scala:355) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.apply(ToolBoxFactory.scala:355) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl.compile(ToolBoxFactory.scala:422) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl.eval(ToolBoxFactory.scala:444) at
[jira] [Created] (SPARK-9257) Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput
Yin Huai created SPARK-9257: --- Summary: Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput Key: SPARK-9257 URL: https://issues.apache.org/jira/browse/SPARK-9257 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai {code} sqlContext.sql( |SELECT sum(value) |FROM agg1 |GROUP BY key .stripMargin).explain() == Physical Plan == Aggregate2Sort Some(List(key#510)), [key#510], [(sum(CAST(value#511, LongType))2,mode=Final,isDistinct=false)], [sum(CAST(value#511, LongType))#1435L], [sum(CAST(value#511, LongType))#1435L AS _c0#1426L] ExternalSort [key#510 ASC], false Exchange hashpartitioning(key#510) Aggregate2Sort None, [key#510], [(sum(CAST(value#511, LongType))2,mode=Partial,isDistinct=false)], [currentSum#1433L], [key#510,currentSum#1433L] ExternalSort [key#510 ASC], false PhysicalRDD [key#510,value#511], MapPartitionsRDD[97] at apply at Transformer.scala:22 sqlContext.sql( |SELECT sum(distinct value) |FROM agg1 |GROUP BY key .stripMargin).explain() == Physical Plan == !FinalAndCompleteAggregate2Sort [key#510,CAST(value#511, LongType)#1446L], [key#510], [(sum(CAST(value#511, LongType)#1446L)2,mode=Complete,isDistinct=false)], [sum(CAST(value#511, LongType))#1445L], [sum(CAST(value#511, LongType))#1445L AS _c0#1438L] Aggregate2Sort Some(List(key#510)), [key#510,CAST(value#511, LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L] ExternalSort [key#510 ASC,CAST(value#511, LongType)#1446L ASC], false Exchange hashpartitioning(key#510) !Aggregate2Sort None, [key#510,CAST(value#511, LongType) AS CAST(value#511, LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L] ExternalSort [key#510 ASC,CAST(value#511, LongType) AS CAST(value#511, LongType)#1446L ASC], false PhysicalRDD [key#510,value#511], MapPartitionsRDD[102] at apply at Transformer.scala:22 {code} For examples shown above, you can see there is a {{!}} at the bingeing of the operator's {{simpleString}}), which indicates that its {{missingInput}} is not empty. Actually, it is a false negative and we need to fix it. Also, it will be good to make these two operators' {{simpleString}} more reader friendly (people can tell what are grouping expressions, what are aggregate functions, and what is the mode of an aggregate function). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637337#comment-14637337 ] Maruf Aytekin commented on SPARK-5992: -- In addition to Charikar's scheme for cosine [~karlhigley] pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps 3. Jaccard distance: Mining Massive Data Sets chapter#3: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637316#comment-14637316 ] Joseph K. Bradley commented on SPARK-6885: -- We can resume this work. Do you think you'd have time to finish it by the end of this week? Sorry for the rush, but the code cutoff for the next release is in ~9 days. If you don't have time right now, I can send a patch instead. Thanks! Decision trees: predict class probabilities --- Key: SPARK-6885 URL: https://issues.apache.org/jira/browse/SPARK-6885 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Under spark.ml, have DecisionTreeClassifier (currently being added) extend ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down
[ https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-9082. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7446 [https://github.com/apache/spark/pull/7446] Filter using non-deterministic expressions should not be pushed down Key: SPARK-9082 URL: https://issues.apache.org/jira/browse/SPARK-9082 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Wenchen Fan Fix For: 1.5.0 For example, {code} val df = sqlContext.range(1, 10).select($id, rand(0).as('r)) df.as(a).join(df.filter($r 0.5).as(b), $a.id === $b.id).explain(true) {code} The plan is {code} == Physical Plan == ShuffledHashJoin [id#55323L], [id#55327L], BuildRight Exchange (HashPartitioning 200) Project [id#55323L,Rand 0 AS r#55324] PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37 Exchange (HashPartitioning 200) Project [id#55327L,Rand 0 AS r#55325] Filter (LessThan) PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37 {code} The rand get evaluated twice instead of once. This is caused by when we push down predicates we replace the attribute reference in the predicate with the actual expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9165) Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct
[ https://issues.apache.org/jira/browse/SPARK-9165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9165. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7537 [https://github.com/apache/spark/pull/7537] Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct -- Key: SPARK-9165 URL: https://issues.apache.org/jira/browse/SPARK-9165 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muller updated SPARK-6970: --- Affects Version/s: (was: 1.4.1) (was: 1.4.0) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable --- Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: John Muller Priority: Trivial Labels: DataFrame Original Estimate: 2h Remaining Estimate: 2h The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637443#comment-14637443 ] Colin Scott commented on SPARK-6028: Curious: does this new RPC implementation use TCP as its underlying transport protocol, or UDP? (I believe akka-remote uses TCP by default.) Provide an alternative RPC implementation based on the network transport module --- Key: SPARK-6028 URL: https://issues.apache.org/jira/browse/SPARK-6028 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Priority: Critical Network transport module implements a low level RPC interface. We can build a new RPC implementation on top of that to replace Akka's. Design document: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?
sutanu das created SPARK-9259: - Summary: How to write Python code to send data from Kafka via Spark to HDFS? Key: SPARK-9259 URL: https://issues.apache.org/jira/browse/SPARK-9259 Project: Spark Issue Type: Question Components: PySpark, Spark Core Reporter: sutanu das 1. How to write Python code to send data from Kafka via Spark to HDFS? 2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any basecode available in Python for sending logfiles via Spark to HDFS? 3. Is there any such config available in Spark to write to HDFS? like hdfs.path = name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9260) Standalone scheduling can overflow a worker with cores
Andrew Or created SPARK-9260: Summary: Standalone scheduling can overflow a worker with cores Key: SPARK-9260 URL: https://issues.apache.org/jira/browse/SPARK-9260 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Nishkam Ravi If the cluster is started with `spark.deploy.spreadOut = false`, then we may allocate more cores than is available on a worker. E.g. a worker has 8 cores, and an application sets `spark.cores.max = 10`, then we end up with the following screenshot: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637443#comment-14637443 ] Colin Scott edited comment on SPARK-6028 at 7/22/15 7:10 PM: - Curious: does this new RPC implementation use TCP as its underlying transport protocol, or UDP? In other words, does the underlying transport protocol guarantee in-order delivery between hosts? (I believe akka-remote uses TCP by default.) Thanks! was (Author: colin_scott): Curious: does this new RPC implementation use TCP as its underlying transport protocol, or UDP? (I believe akka-remote uses TCP by default.) Provide an alternative RPC implementation based on the network transport module --- Key: SPARK-6028 URL: https://issues.apache.org/jira/browse/SPARK-6028 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Priority: Critical Network transport module implements a low level RPC interface. We can build a new RPC implementation on top of that to replace Akka's. Design document: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637472#comment-14637472 ] Reynold Xin commented on SPARK-6028: TPC. Provide an alternative RPC implementation based on the network transport module --- Key: SPARK-6028 URL: https://issues.apache.org/jira/browse/SPARK-6028 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Priority: Critical Network transport module implements a low level RPC interface. We can build a new RPC implementation on top of that to replace Akka's. Design document: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9224: - Shepherd: Joseph K. Bradley OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9224: - Assignee: Feynman Liang OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3947) Support Scala/Java UDAF
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-3947: Summary: Support Scala/Java UDAF (was: Support UDAF) Support Scala/Java UDAF --- Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Pei-Lun Lee Assignee: Yin Huai Fix For: 1.5.0 Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9261) StreamingTab calls public APIs in Spark core that expose shaded classes
Marcelo Vanzin created SPARK-9261: - Summary: StreamingTab calls public APIs in Spark core that expose shaded classes Key: SPARK-9261 URL: https://issues.apache.org/jira/browse/SPARK-9261 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Priority: Minor There's a minor issue in {{StreamingTab}} that has hit me a couple of times when building with maven. It calls methods in {{JettyUtils}} and {{WebUI}} that expose Jetty types (namely {{ServletContextHandler}}). Since Jetty is now shaded, it's not safe to do that, since when running unit tests the spark-core jar will have the shaded version of the APIs while the streaming classes haven't been shaded yet. This seems, at the lowest level, to be a bug in scalac (I've run into this issue in other modules before), since the code shouldn't compile at all, but we should avoid that kind of thing in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9262) Treat Scala compiler warnings as errors
[ https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637581#comment-14637581 ] Apache Spark commented on SPARK-9262: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7598 Treat Scala compiler warnings as errors --- Key: SPARK-9262 URL: https://issues.apache.org/jira/browse/SPARK-9262 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Reynold Xin I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch updates warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9262) Treat Scala compiler warnings as errors
[ https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9262: --- Assignee: Apache Spark (was: Reynold Xin) Treat Scala compiler warnings as errors --- Key: SPARK-9262 URL: https://issues.apache.org/jira/browse/SPARK-9262 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Apache Spark I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch updates warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9258) Remove BroadcastLeftSemiJoinHash
Reynold Xin created SPARK-9258: -- Summary: Remove BroadcastLeftSemiJoinHash Key: SPARK-9258 URL: https://issues.apache.org/jira/browse/SPARK-9258 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin We have too many join operators than our resources to optimize them. In this case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an equi-join operator to do the join, and just not include any values from the other join. We waste a little bit space due to building a hash map rather than a hash set, but at the end of the day unless we are going to spend a lot of time optimizing hash set, our Tungsten hash map will be a lot more efficient than the hash set anyway ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?
[ https://issues.apache.org/jira/browse/SPARK-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637458#comment-14637458 ] Marcelo Vanzin commented on SPARK-9259: --- Are you trying to report a bug or to ask questions? This is a bug tracker. For generic questions, please use the mailing lists: http://spark.apache.org/community.html How to write Python code to send data from Kafka via Spark to HDFS? --- Key: SPARK-9259 URL: https://issues.apache.org/jira/browse/SPARK-9259 Project: Spark Issue Type: Question Components: PySpark, Spark Core Reporter: sutanu das 1. How to write Python code to send data from Kafka via Spark to HDFS? 2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any basecode available in Python for sending logfiles via Spark to HDFS? 3. Is there any such config available in Spark to write to HDFS? like hdfs.path = name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?
[ https://issues.apache.org/jira/browse/SPARK-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-9259. --- Resolution: Invalid How to write Python code to send data from Kafka via Spark to HDFS? --- Key: SPARK-9259 URL: https://issues.apache.org/jira/browse/SPARK-9259 Project: Spark Issue Type: Question Components: PySpark, Spark Core Reporter: sutanu das 1. How to write Python code to send data from Kafka via Spark to HDFS? 2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any basecode available in Python for sending logfiles via Spark to HDFS? 3. Is there any such config available in Spark to write to HDFS? like hdfs.path = name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muller updated SPARK-6970: --- Target Version/s: (was: 1.3.2) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable --- Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0, 1.4.0, 1.4.1 Reporter: John Muller Priority: Trivial Labels: DataFrame Original Estimate: 2h Remaining Estimate: 2h The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muller updated SPARK-6970: --- Affects Version/s: 1.4.0 1.4.1 Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable --- Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0, 1.4.0, 1.4.1 Reporter: John Muller Priority: Trivial Labels: DataFrame Original Estimate: 2h Remaining Estimate: 2h The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9258) Remove all semi join physical operator
[ https://issues.apache.org/jira/browse/SPARK-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9258: --- Description: We have 4 semi join operators. In this case, they are not very very necessary. We can still use an equi-join operator to do the join, and just not include any values from the other join. We waste a little bit space due to building a hash map rather than a hash set, but at the end of the day unless we are going to spend a lot of time optimizing hash set, our Tungsten hash map will be a lot more efficient than the hash set anyway. This way, semi-join automatically benefits from all the work we do in Tungsten. was: We have too many join operators than our resources to optimize them. In this case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an equi-join operator to do the join, and just not include any values from the other join. We waste a little bit space due to building a hash map rather than a hash set, but at the end of the day unless we are going to spend a lot of time optimizing hash set, our Tungsten hash map will be a lot more efficient than the hash set anyway ... Remove all semi join physical operator -- Key: SPARK-9258 URL: https://issues.apache.org/jira/browse/SPARK-9258 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin We have 4 semi join operators. In this case, they are not very very necessary. We can still use an equi-join operator to do the join, and just not include any values from the other join. We waste a little bit space due to building a hash map rather than a hash set, but at the end of the day unless we are going to spend a lot of time optimizing hash set, our Tungsten hash map will be a lot more efficient than the hash set anyway. This way, semi-join automatically benefits from all the work we do in Tungsten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9024. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7480 [https://github.com/apache/spark/pull/7480] Unsafe HashJoin --- Key: SPARK-9024 URL: https://issues.apache.org/jira/browse/SPARK-9024 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Fix For: 1.5.0 Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9224. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7454 [https://github.com/apache/spark/pull/7454] OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Fix For: 1.5.0 OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4234) Always do paritial aggregation
[ https://issues.apache.org/jira/browse/SPARK-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-4234. - Resolution: Fixed Fix Version/s: 1.5.0 With the interface introduced by SPARK-4233, we have the capability to always do partial aggregations. I am resolving it. Always do paritial aggregation --- Key: SPARK-4234 URL: https://issues.apache.org/jira/browse/SPARK-4234 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Fix For: 1.5.0 Currently, UDAF developer optionally implement a partial aggregation function, However this probably cause performance issue by allowing do that. We actually can always force developers to provide the partial aggregation function as Hive does, hence we will always get the `mapside` aggregation optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application
[ https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Scott updated SPARK-9256: --- Description: This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. It should be possible to trigger this bug even if the underlying transport protocol is TCP, since TCP only guarantees in-order delivery in each direction of the connection but not in both directions. was: This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. It should be possible to trigger this bug even if the underlying transport protocol is TCP, since TCP only guarantees in-order delivery in each direction of the connection, but not in both directions. Message delay causes Master crash upon registering application -- Key: SPARK-9256 URL: https://issues.apache.org/jira/browse/SPARK-9256 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Colin Scott Priority: Minor Original Estimate: 1h Remaining Estimate: 1h This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient
[jira] [Created] (SPARK-9262) Treat Scala compiler warnings as errors
Reynold Xin created SPARK-9262: -- Summary: Treat Scala compiler warnings as errors Key: SPARK-9262 URL: https://issues.apache.org/jira/browse/SPARK-9262 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Reynold Xin I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch updates warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9262) Treat Scala compiler warnings as errors
[ https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9262: --- Assignee: Reynold Xin (was: Apache Spark) Treat Scala compiler warnings as errors --- Key: SPARK-9262 URL: https://issues.apache.org/jira/browse/SPARK-9262 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Reynold Xin I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch updates warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9258) Remove all semi join physical operator
[ https://issues.apache.org/jira/browse/SPARK-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9258: --- Summary: Remove all semi join physical operator (was: Remove BroadcastLeftSemiJoinHash) Remove all semi join physical operator -- Key: SPARK-9258 URL: https://issues.apache.org/jira/browse/SPARK-9258 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin We have too many join operators than our resources to optimize them. In this case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an equi-join operator to do the join, and just not include any values from the other join. We waste a little bit space due to building a hash map rather than a hash set, but at the end of the day unless we are going to spend a lot of time optimizing hash set, our Tungsten hash map will be a lot more efficient than the hash set anyway ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muller closed SPARK-6970. -- Undocumented parts of DataFrames were deprecated Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable --- Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: John Muller Priority: Trivial Labels: DataFrame Fix For: 1.4.0 Original Estimate: 2h Remaining Estimate: 2h The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muller resolved SPARK-6970. Resolution: Won't Fix Fix Version/s: 1.4.0 Resolving as won't fix. The new DataFrames API deprecated save and saveAsTable. The new write() method also lacks docs; will open a new ticket when I have at least a partial patch for that. Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable --- Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: John Muller Priority: Trivial Labels: DataFrame Fix For: 1.4.0 Original Estimate: 2h Remaining Estimate: 2h The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application
[ https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Scott updated SPARK-9256: --- Description: This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. It should be possible to trigger this bug even if the underlying transport protocol is TCP, since TCP only guarantees in-order delivery in each direction of the connection, but not in both directions. was: This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. Message delay causes Master crash upon registering application -- Key: SPARK-9256 URL: https://issues.apache.org/jira/browse/SPARK-9256 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Colin Scott Priority: Minor Original Estimate: 1h Remaining Estimate: 1h This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a
[jira] [Commented] (SPARK-6802) User Defined Aggregate Function Refactoring
[ https://issues.apache.org/jira/browse/SPARK-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637555#comment-14637555 ] Yin Huai commented on SPARK-6802: - We have added Scala/Java UDAF support through SPARK-3947. Is this JIRA for Python UDAF? User Defined Aggregate Function Refactoring --- Key: SPARK-6802 URL: https://issues.apache.org/jira/browse/SPARK-6802 Project: Spark Issue Type: Improvement Components: PySpark, SQL Environment: We use Spark Dataframe, SQL along with json, sql and pandas quite a bit Reporter: cynepia While trying to use custom aggregates in spark (something which is common in pandas), We realized that Custom Aggregate Functions aren't well supported across various features/functions in Spark beyond what is supported by Hive. There are futher discussions on the topic viz-a -viz the issue SPARK-3947, which points to similar improvement tickets opened earlier for refactoring the UDAF area. While we refactor the interface for aggregates, It would make sense to keep in consideration, the recently added DataFrame, GroupedData, and possibly also sql.dataframe.Column, which looks different from pandas.Series and isn't currently supporting any aggregations. Would like to get feedback from the folks, who are actively looking at this... We would be happy to participate and contribute, if there are any discussions on the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4233) Simplify the Aggregation Function implementation
[ https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4233. Resolution: Fixed Assignee: Cheng Hao Fix Version/s: 1.5.0 Simplify the Aggregation Function implementation Key: SPARK-4233 URL: https://issues.apache.org/jira/browse/SPARK-4233 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.5.0 Currently, the UDAF implementation is quite complicated, and we have to provide distinct non-distinct version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4367) Partial aggregation support the DISTINCT aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4367. Resolution: Fixed Assignee: Yin Huai Fix Version/s: 1.5.0 Partial aggregation support the DISTINCT aggregation Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Yin Huai Fix For: 1.5.0 Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9231) DistributedLDAModel method for top topics per document
[ https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9231: - Description: Helper method in DistributedLDAModel of this form: {code} /** * For each document, return the top k weighted topics for that document. * @return RDD of (doc ID, topic indices, topic weights) */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. was: Helper method in DistributedLDAModel of this form: {code} /** For each document, return the top k weighted topics for that document. */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. DistributedLDAModel method for top topics per document -- Key: SPARK-9231 URL: https://issues.apache.org/jira/browse/SPARK-9231 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Original Estimate: 48h Remaining Estimate: 48h Helper method in DistributedLDAModel of this form: {code} /** * For each document, return the top k weighted topics for that document. * @return RDD of (doc ID, topic indices, topic weights) */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9223) Support model save/load in Python's LDA
[ https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636386#comment-14636386 ] Apache Spark commented on SPARK-9223: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/7587 Support model save/load in Python's LDA --- Key: SPARK-9223 URL: https://issues.apache.org/jira/browse/SPARK-9223 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan
[ https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636405#comment-14636405 ] Apache Spark commented on SPARK-8856: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7590 Better instrumentation and visualization for physical plan -- Key: SPARK-8856 URL: https://issues.apache.org/jira/browse/SPARK-8856 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu This is an umbrella ticket to improve physical plan instrumentation and visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8856) Better instrumentation and visualization for physical plan
[ https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8856: --- Assignee: Apache Spark (was: Shixiong Zhu) Better instrumentation and visualization for physical plan -- Key: SPARK-8856 URL: https://issues.apache.org/jira/browse/SPARK-8856 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Apache Spark This is an umbrella ticket to improve physical plan instrumentation and visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9213: -- Target Version/s: 1.6.0 (was: ) Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9250) ./dev/change-scala-version.sh should offer guidance what versions are accepted, i.e. 2.10 or 2.11
Jacek Laskowski created SPARK-9250: -- Summary: ./dev/change-scala-version.sh should offer guidance what versions are accepted, i.e. 2.10 or 2.11 Key: SPARK-9250 URL: https://issues.apache.org/jira/browse/SPARK-9250 Project: Spark Issue Type: Improvement Components: Build Environment: commit c03299a18b4e076cabb4b7833a1e7632c5c0dabe Reporter: Jacek Laskowski Priority: Minor With the commit f5b6dc5 there's this new way of building Spark with Scala 2.10 and 2.11. The help given is not very helpful and could be improved about the possible versions and their format. {code} ➜ spark git:(master) ./dev/change-scala-version.sh Usage: change-scala-version.sh version {code} I can see inside - that could be part of the help. {code} VALID_VERSIONS=( 2.10 2.11 ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7075) Project Tungsten: Improving Physical Execution
[ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7075: --- Assignee: Reynold Xin (was: Apache Spark) Project Tungsten: Improving Physical Execution -- Key: SPARK-7075 URL: https://issues.apache.org/jira/browse/SPARK-7075 Project: Spark Issue Type: Epic Components: Block Manager, Shuffle, Spark Core, SQL Reporter: Reynold Xin Assignee: Reynold Xin Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware. *Memory Management and Binary Processing* - Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. - Minimizing memory usage through denser in-memory data format, which means we spill less. - Better memory accounting (size of bytes) rather than relying on heuristics - For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization *Cache-aware Computation* - Faster sorting and hashing for aggregations, joins, and shuffle *Code Generation* - Faster expression evaluation and DataFrame/SQL operators - Faster serializer Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7075) Project Tungsten: Improving Physical Execution
[ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636496#comment-14636496 ] Apache Spark commented on SPARK-7075: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7592 Project Tungsten: Improving Physical Execution -- Key: SPARK-7075 URL: https://issues.apache.org/jira/browse/SPARK-7075 Project: Spark Issue Type: Epic Components: Block Manager, Shuffle, Spark Core, SQL Reporter: Reynold Xin Assignee: Reynold Xin Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware. *Memory Management and Binary Processing* - Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. - Minimizing memory usage through denser in-memory data format, which means we spill less. - Better memory accounting (size of bytes) rather than relying on heuristics - For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization *Cache-aware Computation* - Faster sorting and hashing for aggregations, joins, and shuffle *Code Generation* - Faster expression evaluation and DataFrame/SQL operators - Faster serializer Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9131) UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636508#comment-14636508 ] Luis Guerra commented on SPARK-9131: By the way, UDFs code should be changed to return StringType(). I changed the data type to string since It does not matter the data type but using the UDFs UDFs change data values --- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Priority: Critical Attachments: testjson_jira9131.z01, testjson_jira9131.z02, testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, testjson_jira9131.z06, testjson_jira9131.zip I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9248) Closing curly-braces should always be on their own line
Yu Ishikawa created SPARK-9248: -- Summary: Closing curly-braces should always be on their own line Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor Closing curly-braces should always be on their own line -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9251) do not order by expressions which still need evaluation
[ https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9251: --- Assignee: (was: Apache Spark) do not order by expressions which still need evaluation --- Key: SPARK-9251 URL: https://issues.apache.org/jira/browse/SPARK-9251 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions
[ https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636546#comment-14636546 ] Apache Spark commented on SPARK-9083: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7593 If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions -- Key: SPARK-9083 URL: https://issues.apache.org/jira/browse/SPARK-9083 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Wenchen Fan When a order by clause has a non-deterministic expression, we actually evaluate it twice, once in the exchange operator when we try to figure out the range partitioner's boundaries and once in the sort operator. We should use a project to materialize the result first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9251) do not order by expressions which still need evaluation
[ https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636545#comment-14636545 ] Apache Spark commented on SPARK-9251: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7593 do not order by expressions which still need evaluation --- Key: SPARK-9251 URL: https://issues.apache.org/jira/browse/SPARK-9251 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions
[ https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9083: --- Assignee: Wenchen Fan (was: Apache Spark) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions -- Key: SPARK-9083 URL: https://issues.apache.org/jira/browse/SPARK-9083 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Wenchen Fan When a order by clause has a non-deterministic expression, we actually evaluate it twice, once in the exchange operator when we try to figure out the range partitioner's boundaries and once in the sort operator. We should use a project to materialize the result first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions
[ https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9083: --- Assignee: Apache Spark (was: Wenchen Fan) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions -- Key: SPARK-9083 URL: https://issues.apache.org/jira/browse/SPARK-9083 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Apache Spark When a order by clause has a non-deterministic expression, we actually evaluate it twice, once in the exchange operator when we try to figure out the range partitioner's boundaries and once in the sort operator. We should use a project to materialize the result first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9251) do not order by expressions which still need evaluation
[ https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9251: --- Assignee: Apache Spark do not order by expressions which still need evaluation --- Key: SPARK-9251 URL: https://issues.apache.org/jira/browse/SPARK-9251 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9223) Support model save/load in Python's LDA
[ https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9223: --- Assignee: Apache Spark Support model save/load in Python's LDA --- Key: SPARK-9223 URL: https://issues.apache.org/jira/browse/SPARK-9223 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9223) Support model save/load in Python's LDA
[ https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9223: --- Assignee: (was: Apache Spark) Support model save/load in Python's LDA --- Key: SPARK-9223 URL: https://issues.apache.org/jira/browse/SPARK-9223 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636407#comment-14636407 ] Apache Spark commented on SPARK-8187: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/7589 date/time function: date_sub Key: SPARK-8187 URL: https://issues.apache.org/jira/browse/SPARK-8187 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_sub(timestamp startdate, int days): timestamp date_sub(timestamp startdate, interval i): timestamp date_sub(date date, int days): date date_sub(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8856) Better instrumentation and visualization for physical plan
[ https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8856: --- Assignee: Shixiong Zhu (was: Apache Spark) Better instrumentation and visualization for physical plan -- Key: SPARK-8856 URL: https://issues.apache.org/jira/browse/SPARK-8856 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu This is an umbrella ticket to improve physical plan instrumentation and visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636406#comment-14636406 ] Apache Spark commented on SPARK-8186: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/7589 date/time function: date_add Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_add(timestamp startdate, int days): timestamp date_add(timestamp startdate, interval i): timestamp date_add(date date, int days): date date_add(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan
[ https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636413#comment-14636413 ] Reynold Xin commented on SPARK-8856: It's fine to mark it as in progress since it is actually in progress. Better instrumentation and visualization for physical plan -- Key: SPARK-8856 URL: https://issues.apache.org/jira/browse/SPARK-8856 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu This is an umbrella ticket to improve physical plan instrumentation and visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan
[ https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636412#comment-14636412 ] Feynman Liang commented on SPARK-8856: -- Oops, I tagged the wrong JIRA in the PR. Can you please mark as Open again? Better instrumentation and visualization for physical plan -- Key: SPARK-8856 URL: https://issues.apache.org/jira/browse/SPARK-8856 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu This is an umbrella ticket to improve physical plan instrumentation and visualization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9131) UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636490#comment-14636490 ] Luis Guerra commented on SPARK-9131: Actually, the file is reduced by compression to 27 mb, closer to the limit but still over it UDFs change data values --- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Priority: Critical I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9131) UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636489#comment-14636489 ] Luis Guerra commented on SPARK-9131: Agree, I have prepared a dataset.json and is ready to be uploaded. However, its size is too large (more than 600 mb). How can I upload it for you? UDFs change data values --- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Priority: Critical I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9131) UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Guerra updated SPARK-9131: --- Attachment: testjson_jira9131.z02 testjson_jira9131.z03 testjson_jira9131.z05 testjson_jira9131.z04 testjson_jira9131.zip testjson_jira9131.z06 testjson_jira9131.z01 I hope they work fine. I have split them into several files to reach the limit size UDFs change data values --- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Priority: Critical Attachments: testjson_jira9131.z01, testjson_jira9131.z02, testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, testjson_jira9131.z06, testjson_jira9131.zip I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance
Joseph K. Bradley created SPARK-9245: Summary: DistributedLDAModel predict top topic per doc-term instance Key: SPARK-9245 URL: https://issues.apache.org/jira/browse/SPARK-9245 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. tokens) are exchangeable, so we should provide an estimate per document-term, rather than per token. Synopsis for DistributedLDAModel: {code} /** @return RDD of (doc ID, vector of top topic index for each term) */ def topTopicAssignments: RDD[(Long, Vector)] {code} Note that using Vector will let us have a sparse encoding which is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8861) Add basic instrumentation to each SparkPlan operator
[ https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636411#comment-14636411 ] Apache Spark commented on SPARK-8861: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7590 Add basic instrumentation to each SparkPlan operator Key: SPARK-8861 URL: https://issues.apache.org/jira/browse/SPARK-8861 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The basic metric can be the number of tuples that is flowing through. We can add more metrics later. In order for this to work, we can add a new accumulators method to SparkPlan that defines the list of accumulators, .e.g. {code} def accumulators: Map[String, Accumulator] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8861) Add basic instrumentation to each SparkPlan operator
[ https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8861: --- Assignee: (was: Apache Spark) Add basic instrumentation to each SparkPlan operator Key: SPARK-8861 URL: https://issues.apache.org/jira/browse/SPARK-8861 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The basic metric can be the number of tuples that is flowing through. We can add more metrics later. In order for this to work, we can add a new accumulators method to SparkPlan that defines the list of accumulators, .e.g. {code} def accumulators: Map[String, Accumulator] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8861) Add basic instrumentation to each SparkPlan operator
[ https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8861: --- Assignee: Apache Spark Add basic instrumentation to each SparkPlan operator Key: SPARK-8861 URL: https://issues.apache.org/jira/browse/SPARK-8861 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark The basic metric can be the number of tuples that is flowing through. We can add more metrics later. In order for this to work, we can add a new accumulators method to SparkPlan that defines the list of accumulators, .e.g. {code} def accumulators: Map[String, Accumulator] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9246) DistributedLDAModel predict top docs per topic
Joseph K. Bradley created SPARK-9246: Summary: DistributedLDAModel predict top docs per topic Key: SPARK-9246 URL: https://issues.apache.org/jira/browse/SPARK-9246 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley For each topic, return top documents based on topicDistributions. Synopsis: {code} /** * @param maxDocuments Max docs to return for each topic * @return Array over topics of (sorted top docs, corresponding doc-topic weights) */ def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])] {code} Note: We will need to make sure that the above return value format is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-7590) Test Issue to Debug JIRA Problem
[ https://issues.apache.org/jira/browse/SPARK-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-7590: -- Test Issue to Debug JIRA Problem Key: SPARK-7590 URL: https://issues.apache.org/jira/browse/SPARK-7590 Project: Spark Issue Type: Bug Reporter: Patrick Wendell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8579) Support arbitrary object in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636495#comment-14636495 ] Apache Spark commented on SPARK-8579: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7591 Support arbitrary object in UnsafeRow - Key: SPARK-8579 URL: https://issues.apache.org/jira/browse/SPARK-8579 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.5.0 It's common to run count(distinct xxx) in SQL, the data type will be UDT of OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage during aggregation. Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7075) Project Tungsten: Improving Physical Execution
[ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7075: --- Assignee: Apache Spark (was: Reynold Xin) Project Tungsten: Improving Physical Execution -- Key: SPARK-7075 URL: https://issues.apache.org/jira/browse/SPARK-7075 Project: Spark Issue Type: Epic Components: Block Manager, Shuffle, Spark Core, SQL Reporter: Reynold Xin Assignee: Apache Spark Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware. *Memory Management and Binary Processing* - Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. - Minimizing memory usage through denser in-memory data format, which means we spill less. - Better memory accounting (size of bytes) rather than relying on heuristics - For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization *Cache-aware Computation* - Faster sorting and hashing for aggregations, joins, and shuffle *Code Generation* - Faster expression evaluation and DataFrame/SQL operators - Faster serializer Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9247) Use BytesToBytesMap in unsafe broadcast join
Davies Liu created SPARK-9247: - Summary: Use BytesToBytesMap in unsafe broadcast join Key: SPARK-9247 URL: https://issues.apache.org/jira/browse/SPARK-9247 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical For better performance (both CPU and memory) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9249) local variable assigned but may not be used
Yu Ishikawa created SPARK-9249: -- Summary: local variable assigned but may not be used Key: SPARK-9249 URL: https://issues.apache.org/jira/browse/SPARK-9249 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor local variable assigned but may not be used -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9244) Increase some default memory limits
[ https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9244: --- Assignee: Apache Spark (was: Matei Zaharia) Increase some default memory limits --- Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Apache Spark There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9244) Increase some default memory limits
[ https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9244: --- Assignee: Matei Zaharia (was: Apache Spark) Increase some default memory limits --- Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9232) Duplicate code in JSONRelation
[ https://issues.apache.org/jira/browse/SPARK-9232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9232. Resolution: Fixed Fix Version/s: 1.5.0 Duplicate code in JSONRelation -- Key: SPARK-9232 URL: https://issues.apache.org/jira/browse/SPARK-9232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Fix For: 1.5.0 The following block appears identically in two places: {code} var success: Boolean = false try { success = fs.delete(filesystemPath, true) } catch { case e: IOException = throw new IOException( sUnable to clear output directory ${filesystemPath.toString} prior + s to writing to JSON table:\n${e.toString}) } if (!success) { throw new IOException( sUnable to clear output directory ${filesystemPath.toString} prior + s to writing to JSON table.) } } {code} https://github.com/apache/spark/blob/e5d2c37c68ac00a57c2542e62d1c5b4ca267c89e/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L72 https://github.com/apache/spark/blob/e5d2c37c68ac00a57c2542e62d1c5b4ca267c89e/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L131 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636367#comment-14636367 ] Apache Spark commented on SPARK-9144: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7585 Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled --- Key: SPARK-9144 URL: https://issues.apache.org/jira/browse/SPARK-9144 Project: Spark Issue Type: Improvement Components: Scheduler, Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark has an option called {{spark.localExecution.enabled}}; according to the docs: {quote} Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. {quote} This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the {{runLocallyWithinThread}} method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9244) Increase some default memory limits
[ https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636371#comment-14636371 ] Apache Spark commented on SPARK-9244: - User 'mateiz' has created a pull request for this issue: https://github.com/apache/spark/pull/7586 Increase some default memory limits --- Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org