[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)
[ https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9550: -- Description: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default (SPARK-9554). *Deprecation* - Local execution has been removed. was: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default (SPARK-9554) - In-memory relation partition pruning should be enabled by default. *Deprecation* - Local execution has been removed. Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket) --- Key: SPARK-9550 URL: https://issues.apache.org/jira/browse/SPARK-9550 Project: Spark Issue Type: Task Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Priority: Blocker This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default (SPARK-9554). *Deprecation* - Local execution has been removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)
[ https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9550: -- Description: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default (SPARK-9554) - In-memory relation partition pruning should be enabled by default. *Deprecation* - Local execution has been removed. was: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default. *Deprecation* - Local execution has been removed. Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket) --- Key: SPARK-9550 URL: https://issues.apache.org/jira/browse/SPARK-9550 Project: Spark Issue Type: Task Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Priority: Blocker This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default (SPARK-9554) - In-memory relation partition pruning should be enabled by default. *Deprecation* - Local execution has been removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)
[ https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652001#comment-14652001 ] Josh Rosen commented on SPARK-9550: --- Memory defaults changed (will find JIRA links later): https://github.com/apache/spark/pull/7896 Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket) --- Key: SPARK-9550 URL: https://issues.apache.org/jira/browse/SPARK-9550 Project: Spark Issue Type: Task Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Priority: Blocker This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default. *Deprecation* - Local execution has been removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651984#comment-14651984 ] partha bishnu commented on SPARK-9559: -- Thanks. If I understand correctly --num-executor is for deploying on Yarn cluster and --total-executor-cores for spark stand-alone cluster. I am using spark stand-alone cluster. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651907#comment-14651907 ] partha bishnu edited comment on SPARK-9559 at 8/3/15 2:24 PM: -- The expected behavior should be that the spark master on n-1 should restart the jobs with one new executor under the running worker jvm on the other worker node n-3 that is up and running after the n-2 went down. Isn that expected behavior ? But that does not happen. Thanks for your comments was (Author: pa1975): The expected behavior should be that the spark master on n-1 should restart the jobs with one new executor under the running worker jvm on the other worker node n-2 that is up and running after the n-3 went down. Isn that expected behavior ? But that does not happen. Thanks for your comments Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9484) Word2Vec import/export for original binary format
[ https://issues.apache.org/jira/browse/SPARK-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651899#comment-14651899 ] Manoj Kumar commented on SPARK-9484: I just went through the C code that does the .bin reading. What would be the best way to go about this? The codepaths should be almost completely different if path.endsWith(.bin) or not right? Also should this use the SaveLoadV1_0 object, or should we have a different object (say SaveLoadBinary) which would keep the codepaths independent and help easier maintenance? Word2Vec import/export for original binary format - Key: SPARK-9484 URL: https://issues.apache.org/jira/browse/SPARK-9484 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor It would be nice to add model import/export for Word2Vec which handles the original binary format used by [https://code.google.com/p/word2vec/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651907#comment-14651907 ] partha bishnu commented on SPARK-9559: -- The expected behavior should be that the spark master on n-1 should restart the jobs with one new executor under the running worker jvm on the other worker node n-2 that is up and running after the n-3 went down. Isn that expected behavior ? But that does not happen. Thanks for your comments Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651923#comment-14651923 ] Sean Owen commented on SPARK-9559: -- OK so you have requested 1 total executor. Did the job fail then? or are you talking about the state after it completed? Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651924#comment-14651924 ] Sean Owen commented on SPARK-9559: -- PS you should try reproducing this on master rather than 1.3, which is relatively old at this stage. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9499) Possible file handle leak in spilling/sort code
[ https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651932#comment-14651932 ] Herman van Hovell commented on SPARK-9499: -- I have also tried {noformat}spark.shuffle.sort.bypassMergeThreshold=0{noformat}. It does improve on the current situation, but now crashes a bit further down the line. I'll attach another {noformat}lsof{noformat} dump. Possible file handle leak in spilling/sort code --- Key: SPARK-9499 URL: https://issues.apache.org/jira/browse/SPARK-9499 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Priority: Blocker Attachments: perf_test4.scala As reported by [~hvanhovell]. See SPARK-8850. Hi, I am getting a Too many open files error since the unsafe mode is on. The same thing popped up when playing with unsafe before. The error is below: {noformat} 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, localhost): java.io.FileNotFoundException: /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac (Too many open files) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.init(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} I am currently working on local mode (which is probably the cause of the problem) using the following command line: {noformat} $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G --driver-library-path $HADOOP_NATIVE_LIB {noformat} The maximum number of files I can open are 1024 (ulimit -n). I have tried to run the same code with an increased limit, but this didn't work out. Dump of all open files after a Too Many Files Open error. The command used to make the dump: {code} lsof -c java open {code} The job starts crashing after as soon as I start sorting 1000 rows for the 9th time (doing benchmarking). I guess files are left open after every benchmark? Is there a way to trigger the closing of files? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9499) Possible file handle leak in spilling/sort code
[ https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651932#comment-14651932 ] Herman van Hovell edited comment on SPARK-9499 at 8/3/15 2:46 PM: -- I have also tried {{spark.shuffle.sort.bypassMergeThreshold=0}}. It does improve on the current situation, but now crashes a bit further down the line. I'll attach another {{lsof}} dump. was (Author: hvanhovell): I have also tried {noformat}spark.shuffle.sort.bypassMergeThreshold=0{noformat}. It does improve on the current situation, but now crashes a bit further down the line. I'll attach another {noformat}lsof{noformat} dump. Possible file handle leak in spilling/sort code --- Key: SPARK-9499 URL: https://issues.apache.org/jira/browse/SPARK-9499 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Priority: Blocker Attachments: perf_test4.scala As reported by [~hvanhovell]. See SPARK-8850. Hi, I am getting a Too many open files error since the unsafe mode is on. The same thing popped up when playing with unsafe before. The error is below: {noformat} 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, localhost): java.io.FileNotFoundException: /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac (Too many open files) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.init(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} I am currently working on local mode (which is probably the cause of the problem) using the following command line: {noformat} $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G --driver-library-path $HADOOP_NATIVE_LIB {noformat} The maximum number of files I can open are 1024 (ulimit -n). I have tried to run the same code with an increased limit, but this didn't work out. Dump of all open files after a Too Many Files Open error. The command used to make the dump: {code} lsof -c java open {code} The job starts crashing after as soon as I start sorting 1000 rows for the 9th time (doing benchmarking). I guess files are left open after every benchmark? Is there a way to trigger the closing of files? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9499) Possible file handle leak in spilling/sort code
[ https://issues.apache.org/jira/browse/SPARK-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-9499: - Attachment: open.files.II.txt {{lsof}} with {{spark.shuffle.sort.bypassMergeThreshold=0}} setting Possible file handle leak in spilling/sort code --- Key: SPARK-9499 URL: https://issues.apache.org/jira/browse/SPARK-9499 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Priority: Blocker Attachments: open.files.II.txt, perf_test4.scala As reported by [~hvanhovell]. See SPARK-8850. Hi, I am getting a Too many open files error since the unsafe mode is on. The same thing popped up when playing with unsafe before. The error is below: {noformat} 15/07/30 23:37:29 WARN TaskSetManager: Lost task 2.0 in stage 33.0 (TID 2423, localhost): java.io.FileNotFoundException: /tmp/blockmgr-b3d3e14a-f313-4075-8082-7d97f012e35a/14/temp_shuffle_1cab42fa-dcb1-4114-ae53-1674446f9dac (Too many open files) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.init(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:111) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} I am currently working on local mode (which is probably the cause of the problem) using the following command line: {noformat} $SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 14G --driver-library-path $HADOOP_NATIVE_LIB {noformat} The maximum number of files I can open are 1024 (ulimit -n). I have tried to run the same code with an increased limit, but this didn't work out. Dump of all open files after a Too Many Files Open error. The command used to make the dump: {code} lsof -c java open {code} The job starts crashing after as soon as I start sorting 1000 rows for the 9th time (doing benchmarking). I guess files are left open after every benchmark? Is there a way to trigger the closing of files? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9560) Add LDA data generator
yuhao yang created SPARK-9560: - Summary: Add LDA data generator Key: SPARK-9560 URL: https://issues.apache.org/jira/browse/SPARK-9560 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add data generator for LDA. Hope it can help with performance improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9560) Add LDA data generator
[ https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9560: --- Assignee: Apache Spark Add LDA data generator -- Key: SPARK-9560 URL: https://issues.apache.org/jira/browse/SPARK-9560 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Assignee: Apache Spark Add data generator for LDA. Hope it can help with performance improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9560) Add LDA data generator
[ https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9560: --- Assignee: (was: Apache Spark) Add LDA data generator -- Key: SPARK-9560 URL: https://issues.apache.org/jira/browse/SPARK-9560 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add data generator for LDA. Hope it can help with performance improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9560) Add LDA data generator
[ https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652061#comment-14652061 ] Apache Spark commented on SPARK-9560: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/7898 Add LDA data generator -- Key: SPARK-9560 URL: https://issues.apache.org/jira/browse/SPARK-9560 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add data generator for LDA. Hope it can help with performance improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9512) RemoveEvaluationFromSort reorders sort order
[ https://issues.apache.org/jira/browse/SPARK-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652620#comment-14652620 ] Apache Spark commented on SPARK-9512: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7906 RemoveEvaluationFromSort reorders sort order Key: SPARK-9512 URL: https://issues.apache.org/jira/browse/SPARK-9512 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Blocker Please refer to the comment in https://github.com/apache/spark/pull/7593 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-925) Allow ec2 scripts to load default options from a json file
[ https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-925: -- Assignee: Apache Spark Allow ec2 scripts to load default options from a json file -- Key: SPARK-925 URL: https://issues.apache.org/jira/browse/SPARK-925 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.8.0 Reporter: Shay Seng Assignee: Apache Spark Priority: Minor The option list for ec2 script can be a little irritating to type in, especially things like path to identity-file, region , zone, ami etc. It would be nice if ec2 script looks for an options.json file in the following order: (1) CWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py Something like: def get_defaults_from_options(): # Check to see if a options.json file exists, if so load it. # However, values in the options.json file can only overide values in opts # if the Opt values are None or # i.e. commandline options take presidence defaults = {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 'ami':'','slaves':1, 'instance-type':'m1.large'} # Look for options.json in directory cluster was called from # Had to modify the spark_ec2 wrapper script since it mangles the pwd startwd = os.environ['STARTWD'] if os.path.exists(os.path.join(startwd,options.json)): optionspath = os.path.join(startwd,options.json) else: optionspath = os.path.join(os.getcwd(),options.json) try: print Loading options file: , optionspath with open (optionspath) as json_data: jdata = json.load(json_data) for k in jdata: defaults[k]=jdata[k] except IOError: print 'Warning: options.json file not loaded' # Check permissions on identity-file, if defined, otherwise launch will fail late and will be irritating if defaults['identity-file']!='': st = os.stat(defaults['identity-file']) user_can_read = bool(st.st_mode stat.S_IRUSR) grp_perms = bool(st.st_mode stat.S_IRWXG) others_perm = bool(st.st_mode stat.S_IRWXO) if (not user_can_read): print No read permission to read , defaults['identify-file'] sys.exit(1) if (grp_perms or others_perm): print Permissions are too open, please chmod 600 file , defaults['identify-file'] sys.exit(1) # if defaults contain AWS access id or private key, set it to environment. # required for use with boto to access the AWS console if defaults['aws-access-key-id'] != '': os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] if defaults['aws-secret-access-key'] != '': os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key'] return defaults -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-925) Allow ec2 scripts to load default options from a json file
[ https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652622#comment-14652622 ] Apache Spark commented on SPARK-925: User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7906 Allow ec2 scripts to load default options from a json file -- Key: SPARK-925 URL: https://issues.apache.org/jira/browse/SPARK-925 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.8.0 Reporter: Shay Seng Priority: Minor The option list for ec2 script can be a little irritating to type in, especially things like path to identity-file, region , zone, ami etc. It would be nice if ec2 script looks for an options.json file in the following order: (1) CWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py Something like: def get_defaults_from_options(): # Check to see if a options.json file exists, if so load it. # However, values in the options.json file can only overide values in opts # if the Opt values are None or # i.e. commandline options take presidence defaults = {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 'ami':'','slaves':1, 'instance-type':'m1.large'} # Look for options.json in directory cluster was called from # Had to modify the spark_ec2 wrapper script since it mangles the pwd startwd = os.environ['STARTWD'] if os.path.exists(os.path.join(startwd,options.json)): optionspath = os.path.join(startwd,options.json) else: optionspath = os.path.join(os.getcwd(),options.json) try: print Loading options file: , optionspath with open (optionspath) as json_data: jdata = json.load(json_data) for k in jdata: defaults[k]=jdata[k] except IOError: print 'Warning: options.json file not loaded' # Check permissions on identity-file, if defined, otherwise launch will fail late and will be irritating if defaults['identity-file']!='': st = os.stat(defaults['identity-file']) user_can_read = bool(st.st_mode stat.S_IRUSR) grp_perms = bool(st.st_mode stat.S_IRWXG) others_perm = bool(st.st_mode stat.S_IRWXO) if (not user_can_read): print No read permission to read , defaults['identify-file'] sys.exit(1) if (grp_perms or others_perm): print Permissions are too open, please chmod 600 file , defaults['identify-file'] sys.exit(1) # if defaults contain AWS access id or private key, set it to environment. # required for use with boto to access the AWS console if defaults['aws-access-key-id'] != '': os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] if defaults['aws-secret-access-key'] != '': os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key'] return defaults -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins
[ https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7165: --- Sprint: Week 32 Sort Merge Join for outer joins --- Key: SPARK-7165 URL: https://issues.apache.org/jira/browse/SPARK-7165 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins
[ https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7165: --- Assignee: Josh Rosen (was: Reynold Xin) Sort Merge Join for outer joins --- Key: SPARK-7165 URL: https://issues.apache.org/jira/browse/SPARK-7165 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Josh Rosen Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7799: - Target Version/s: 1.6.0 (was: 1.5.0) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4246) Add testsuite with end-to-end testing of driver failure
[ https://issues.apache.org/jira/browse/SPARK-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4246: - Target Version/s: (was: 1.5.0) Add testsuite with end-to-end testing of driver failure Key: SPARK-4246 URL: https://issues.apache.org/jira/browse/SPARK-4246 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9131) Python UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652694#comment-14652694 ] Davies Liu commented on SPARK-9131: --- I think this maybe fixed by https://github.com/apache/spark/pull/7131 [~luispeguerra] Could you help to confirm that its' fixed in master or not? Python UDFs change data values -- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Assignee: Davies Liu Priority: Blocker Attachments: testjson_jira9131.z01, testjson_jira9131.z02, testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, testjson_jira9131.z06, testjson_jira9131.zip I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches
[ https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7441: - Target Version/s: 1.6.0 (was: 1.5.0) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches --- Key: SPARK-7441 URL: https://issues.apache.org/jira/browse/SPARK-7441 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Emre Sevinç Labels: performance Implement microbatch functionality so that Spark Streaming can process a huge backlog of existing files discovered in batch in smaller batches. Spark Streaming can process already existing files in a directory, and depending on the value of {{spark.streaming.minRememberDuration}} (60 seconds by default, see SPARK-3276 for more details), this might mean that a Spark Streaming application can receive thousands, or hundreds of thousands of files within the first batch interval. This, in turn, leads to something like a 'flooding' effect for the streaming application, that tries to deal with a huge number of existing files in a single batch interval. We will propose a very simple change to {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a configuration property such as {{spark.streaming.microbatch.size}}, it will either keep its default behavior when {{spark.streaming.microbatch.size}} will have the default value of {{0}} (meaning as many as has been discovered as new files in the current batch interval), or will process new files in groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s). We have tested this patch in one of our customers, and it's been running successfully for weeks (e.g. there were cases where our Spark Streaming application was stopped, and in the meantime tens of thousands file were created in a directory, and our Spark Streaming application had to process those existing files after it was started). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)
[ https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6116: --- Target Version/s: 1.5.0 (was: 1.6.0) DataFrame API improvement umbrella ticket (Spark 1.5) - Key: SPARK-6116 URL: https://issues.apache.org/jira/browse/SPARK-6116 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Labels: DataFrame An umbrella ticket to track improvements and changes needed to make DataFrame API non-experimental. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)
[ https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6116: --- Priority: Critical (was: Blocker) DataFrame API improvement umbrella ticket (Spark 1.5) - Key: SPARK-6116 URL: https://issues.apache.org/jira/browse/SPARK-6116 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Labels: DataFrame An umbrella ticket to track improvements and changes needed to make DataFrame API non-experimental. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9572) Add StreamingContext.getActiveOrCreate() to python API
[ https://issues.apache.org/jira/browse/SPARK-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-9572: - Target Version/s: 1.4.2, 1.5.0 (was: 1.5.0) Add StreamingContext.getActiveOrCreate() to python API -- Key: SPARK-9572 URL: https://issues.apache.org/jira/browse/SPARK-9572 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9579) Improve Word2Vec unit tests
Joseph K. Bradley created SPARK-9579: Summary: Improve Word2Vec unit tests Key: SPARK-9579 URL: https://issues.apache.org/jira/browse/SPARK-9579 Project: Spark Issue Type: Test Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Word2Vec unit tests should be improved in a few ways: * Test individual components of the algorithm. This may mean breaking the code into smaller methods which can be tested individually. * Test vs another library, if possible. Following the example of unit tests for LogisticRegression, create robust unit tests making sure the two implementations produce similar results. This may be too hard to do robustly (and deterministically). In that case, the first improvement will suffice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns
[ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9323: --- Target Version/s: 1.6.0 (was: 1.5.0) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns --- Key: SPARK-9323 URL: https://issues.apache.org/jira/browse/SPARK-9323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Josh Rosen The following two queries should be equivalent, but the second crashes: {code} sqlContext.read.json(sqlContext.sparkContext.makeRDD( {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil)) .registerTempTable(nestedOrder) checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1)) checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), Row(1)) {code} Here's the stacktrace: {code} Cannot resolve column name a.b among (b); org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b among (b); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593) at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624) at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389) {code} Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}}, causing the nested field to be treated as a single field named {{a.b}}. UPDATE: here's a shorter one-liner reproduction: {code} val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: {b: 1}} :: Nil)) checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error
[ https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7659: --- Target Version/s: 1.6.0 (was: 1.5.0) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error Key: SPARK-7659 URL: https://issues.apache.org/jira/browse/SPARK-7659 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang flowing sql get error: select month, sum(product) over (partition by month) from windowData order by area -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7821) Hide private SQL JDBC classes from Javadoc
[ https://issues.apache.org/jira/browse/SPARK-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-7821: -- Assignee: Reynold Xin Hide private SQL JDBC classes from Javadoc -- Key: SPARK-7821 URL: https://issues.apache.org/jira/browse/SPARK-7821 Project: Spark Issue Type: Improvement Components: Documentation, SQL Reporter: Josh Rosen Assignee: Reynold Xin We should hide {{private\[sql\]}} JDBC classes from the generated Javadoc, since showing these internal classes can be confusing to users. This is especially important for the SQL {{jdbc}} package because it contains an internal JDBCRDD class which is easily confused with the public JdbcRDD class in Spark Core (see SPARK-7804 for an example of this). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9263) Add Spark Submit flag to exclude dependencies when using --packages
[ https://issues.apache.org/jira/browse/SPARK-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-9263. --- Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 1.5.0 Add Spark Submit flag to exclude dependencies when using --packages --- Key: SPARK-9263 URL: https://issues.apache.org/jira/browse/SPARK-9263 Project: Spark Issue Type: New Feature Components: Spark Submit Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.5.0 While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9583) build/mvn script should not print debug messages to stdout
[ https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9583: --- Assignee: Apache Spark build/mvn script should not print debug messages to stdout -- Key: SPARK-9583 URL: https://issues.apache.org/jira/browse/SPARK-9583 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Apache Spark Priority: Minor Doing that means it cannot be used to run {{make-distribution.sh}}, which parses the stdout of maven commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9583) build/mvn script should not print debug messages to stdout
[ https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9583: --- Assignee: (was: Apache Spark) build/mvn script should not print debug messages to stdout -- Key: SPARK-9583 URL: https://issues.apache.org/jira/browse/SPARK-9583 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Priority: Minor Doing that means it cannot be used to run {{make-distribution.sh}}, which parses the stdout of maven commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7542: --- Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-9457) Support off-heap sort buffer in UnsafeExternalSorter Key: SPARK-7542 URL: https://issues.apache.org/jira/browse/SPARK-7542 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} arrays as its sort buffers. When records are small, the sorting array might be as large as the data pages, so it would be useful to be able to allocate this array off-heap (using our unsafe LongArray). Unfortunately, we can't currently do this because TimSort calls {{allocate()}} to create data buffers but doesn't call any corresponding cleanup methods to free them. We should look into extending TimSort with buffer freeing methods, then consider switching to LongArray in UnsafeShuffleSortDataFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9583) build/mvn script should not print debug messages to stdout
[ https://issues.apache.org/jira/browse/SPARK-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652880#comment-14652880 ] Apache Spark commented on SPARK-9583: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7915 build/mvn script should not print debug messages to stdout -- Key: SPARK-9583 URL: https://issues.apache.org/jira/browse/SPARK-9583 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Priority: Minor Doing that means it cannot be used to run {{make-distribution.sh}}, which parses the stdout of maven commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9457) Sorting improvements
[ https://issues.apache.org/jira/browse/SPARK-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9457. Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 1.5.0 Sorting improvements Key: SPARK-9457 URL: https://issues.apache.org/jira/browse/SPARK-9457 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 An umbrella ticket to improve sorting in Tungsten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9585) HiveHBaseTableInputFormat can'be cached
[ https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9585: --- Assignee: Apache Spark HiveHBaseTableInputFormat can'be cached --- Key: SPARK-9585 URL: https://issues.apache.org/jira/browse/SPARK-9585 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Assignee: Apache Spark Below exception occurs in Spark On HBase function. {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451] {quote} When an executor has many cores, the tasks belongs to same RDD will using the same InputFormat. But the HiveHBaseTableInputFormat is not thread safety. So I think we should add a config to enable cache InputFormat or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9585) HiveHBaseTableInputFormat can'be cached
[ https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9585: --- Assignee: (was: Apache Spark) HiveHBaseTableInputFormat can'be cached --- Key: SPARK-9585 URL: https://issues.apache.org/jira/browse/SPARK-9585 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Below exception occurs in Spark On HBase function. {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451] {quote} When an executor has many cores, the tasks belongs to same RDD will using the same InputFormat. But the HiveHBaseTableInputFormat is not thread safety. So I think we should add a config to enable cache InputFormat or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8064) Upgrade Hive to 1.2
[ https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8064. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7191 [https://github.com/apache/spark/pull/7191] Upgrade Hive to 1.2 --- Key: SPARK-8064 URL: https://issues.apache.org/jira/browse/SPARK-8064 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Steve Loughran Priority: Blocker Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652701#comment-14652701 ] Apache Spark commented on SPARK-9516: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/7910 Improve Thread Dump page Key: SPARK-9516 URL: https://issues.apache.org/jira/browse/SPARK-9516 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Nan Zhu Originally proposed by [~irashid] in https://github.com/apache/spark/pull/7808#issuecomment-126788335: we can enhance the current thread dump page with at least the following two new features: 1) sort threads by thread status, 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2870: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9392) Dataframe drop should work on unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9392: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Dataframe drop should work on unresolved columns Key: SPARK-9392 URL: https://issues.apache.org/jira/browse/SPARK-9392 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical i.e. I would expect `df.drop($colName)` to work. Another example is the test case here: https://github.com/apache/spark/pull/6585/files#diff-5d2ebf4e9ca5a990136b276859769289R355 which I would expect to not be a no-op. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data
[ https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8000: --- Parent Issue: SPARK-9576 (was: SPARK-6116) SQLContext.read.load() should be able to auto-detect input data --- Key: SPARK-8000 URL: https://issues.apache.org/jira/browse/SPARK-8000 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it is an ORC file, use ORC. If it is a CSV file, use CSV. Maybe Spark SQL can also write an output metadata file to specify the schema data source that's used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7160) Support converting DataFrames to typed RDDs.
[ https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652851#comment-14652851 ] Michael Armbrust commented on SPARK-7160: - I spent about and hour trying to fix conflicts and get the tests to pass, but unfortunately I think this is going to miss the release as a lot of stuff has changed now that we are using {{InternalRow}}. This would be a really good feature to have so we should sync up around the beginning of 1.6 if you have time to update [~rayortigas] and we can make sure to merge it quickly so conflicts don't accumulate again. Support converting DataFrames to typed RDDs. Key: SPARK-7160 URL: https://issues.apache.org/jira/browse/SPARK-7160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Ray Ortigas Assignee: Ray Ortigas Priority: Critical As a Spark user still working with RDDs, I'd like the ability to convert a DataFrame to a typed RDD. For example, if I've converted RDDs to DataFrames so that I could save them as Parquet or CSV files, I would like to rebuild the RDD from those files automatically rather than writing the row-to-type conversion myself. {code} val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), Food(cherry, 3))) val df0 = rdd0.toDF() df0.save(foods.parquet) val df1 = sqlContext.load(foods.parquet) val rdd1 = df1.toTypedRDD[Food]() // rdd0 and rdd1 should have the same elements {code} I originally submitted a smaller PR for spark-csv https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested that converting a DataFrame to a typed RDD wasn't something specific to spark-csv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9580) Refactor TestSQLContext to make it non-singleton
Andrew Or created SPARK-9580: Summary: Refactor TestSQLContext to make it non-singleton Key: SPARK-9580 URL: https://issues.apache.org/jira/browse/SPARK-9580 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Because the TestSQLContext is a singleton object, there is literally no way to start a SparkContext in the SQL tests since we disallow multiple SparkContexts in the same JVM. Starting a custom SparkContext is useful when we want to run Spark in local-cluster mode or enable the UI, which is normally disabled. This is a blocker for 1.5 because we currently have tests entirely commented out due to this limitation. https://github.com/apache/spark/blob/7abaaad5b169520fbf7299808b2bafde089a16a2/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9582) Improve clarity of LocalLDAModel log likelihood methods
Joseph K. Bradley created SPARK-9582: Summary: Improve clarity of LocalLDAModel log likelihood methods Key: SPARK-9582 URL: https://issues.apache.org/jira/browse/SPARK-9582 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous to DistributedLDAModel.likelihood. The former includes the log likelihood of the inferred topics, but the latter does not. This JIRA is for refactoring the former to separate out the log likelihood of the inferred topics. CC: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early
[ https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-9372: I reverted the merged patch since it had a few problems. For a join operator, rows with null equal join key expression can be filtered out early --- Key: SPARK-9372 URL: https://issues.apache.org/jira/browse/SPARK-9372 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can filter out rows that have null values for column A.key/B.key because those rows do not contribute to the result of the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early
[ https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9372: --- Target Version/s: 1.6.0 For a join operator, rows with null equal join key expression can be filtered out early --- Key: SPARK-9372 URL: https://issues.apache.org/jira/browse/SPARK-9372 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can filter out rows that have null values for column A.key/B.key because those rows do not contribute to the result of the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early
[ https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9372: --- Fix Version/s: (was: 1.5.0) For a join operator, rows with null equal join key expression can be filtered out early --- Key: SPARK-9372 URL: https://issues.apache.org/jira/browse/SPARK-9372 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can filter out rows that have null values for column A.key/B.key because those rows do not contribute to the result of the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9575: --- Assignee: Apache Spark Add documentation around Mesos shuffle service and dynamic allocation - Key: SPARK-9575 URL: https://issues.apache.org/jira/browse/SPARK-9575 Project: Spark Issue Type: Documentation Components: Mesos Reporter: Timothy Chen Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652634#comment-14652634 ] Apache Spark commented on SPARK-9575: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/7907 Add documentation around Mesos shuffle service and dynamic allocation - Key: SPARK-9575 URL: https://issues.apache.org/jira/browse/SPARK-9575 Project: Spark Issue Type: Documentation Components: Mesos Reporter: Timothy Chen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9575: --- Assignee: (was: Apache Spark) Add documentation around Mesos shuffle service and dynamic allocation - Key: SPARK-9575 URL: https://issues.apache.org/jira/browse/SPARK-9575 Project: Spark Issue Type: Documentation Components: Mesos Reporter: Timothy Chen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9575) Add documentation around Mesos shuffle service and dynamic allocation
Timothy Chen created SPARK-9575: --- Summary: Add documentation around Mesos shuffle service and dynamic allocation Key: SPARK-9575 URL: https://issues.apache.org/jira/browse/SPARK-9575 Project: Spark Issue Type: Documentation Components: Mesos Reporter: Timothy Chen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7791) Set user for executors in standalone-mode
[ https://issues.apache.org/jira/browse/SPARK-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652660#comment-14652660 ] Niels Becker commented on SPARK-7791: - We endet up using your workaround. But since our Spark-Slaves are running inside docker containers and GlusterFS is mounted on the host mashine, we were able to only mount the according user folders into the docker container by setting {{spark.mesos.executor.docker.volumes}}. This way Spark is not able to write to other users. Set user for executors in standalone-mode - Key: SPARK-7791 URL: https://issues.apache.org/jira/browse/SPARK-7791 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Tomasz Früboes I'm opening this following a discussion in https://www.mail-archive.com/user@spark.apache.org/msg28633.html Our setup was following. Spark (1.3.1, prebuilt for hadoop 2.6, also 2.4) was installed in the standalone mode and started manually from the root account. Everything worked properly apart of operations such us rdd.saveAsPickleFile(ofile) which end with exception: py4j.protocol.Py4JJavaError: An error occurred while calling o27.save. : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet; isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346) (files created in _temporary were owned by user root). It would be great if spark could set the user for the executor also in standalone mode. Setting SPARK_USER has no effect here. BTW it may be a good idea to add some warning (e.g. during spark startup) that running from root account is not very healthy idea. E.g. mapping this function def test(x): f = open('/etc/testTMF.txt', 'w') return 0 on a rdd creates a file in /etc/ (surprisingly calls like f.Write(text) end with an exception) Thanks, Tomasz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9228: --- Assignee: Michael Armbrust (was: Apache Spark) Combine unsafe and codegen into a single option --- Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
[ https://issues.apache.org/jira/browse/SPARK-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652659#comment-14652659 ] Davies Liu commented on SPARK-9482: --- The Physical plan looks very strange, it use unsafe BroadcastHashOuterJoin and BroadcastLeftSemiJoinHash, but use safe Projection (which should be TungstenProjection). I tried locally, it does use TungstenProject. Is it possible that conf.unsafeEnabled is flaky? (changed by some tests) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin --- Key: SPARK-9482 URL: https://issues.apache.org/jira/browse/SPARK-9482 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Yin Huai Priority: Blocker Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/ {code} Regression org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin Failing for the past 1 build (Since Failed#39059 ) Took 7.7 sec. Error Message Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key ASC], false 'Project [unresolvedalias('a.key)] 'Join RightOuter, Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b) 'UnresolvedRelation [t1], Some(c) == Analyzed Logical Plan == key: int Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation default, t2, Some(b)MetastoreRelation default, t1, Some(c) == Optimized Logical Plan == Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join LeftSemi, Some((key#176228 = key#176230)) Project [key#176228] MetastoreRelation default, t3, Some(a) Project [key#176230] MetastoreRelation default, t2, Some(b)Project [key#176232] MetastoreRelation default, t1, Some(c) == Physical Plan == ExternalSort [key#176228 ASC], false Project [key#176228] ConvertToSafe BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None ConvertToUnsafe Project [key#176228] ConvertToSafe BroadcastLeftSemiJoinHash [key#176228], [key#176230], None ConvertToUnsafe HiveTableScan [key#176228], (MetastoreRelation default, t3, Some(a)) ConvertToUnsafe HiveTableScan [key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c)) Code Generation: true == RDD == key !== HIVE - 31 row(s) == == CATALYST - 30 row(s) == 00 00 0 0 00 00 0 0 00 00 00 00 0 0 00 00 0 0 00 00 0 0 00 10 10 10 10 10 10 10 10 !48 !48 !8 NULL !8NULL NULL NULL NULL NULL NULL NULL NULL NULL !NULL Stacktrace sbt.ForkMain$ForkError: Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key ASC], false 'Project [unresolvedalias('a.key)] 'Join RightOuter, Some(('a.key = 'c.key)) 'Join LeftSemi, Some(('a.key = 'b.key)) 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b) 'UnresolvedRelation [t1], Some(c) == Analyzed Logical Plan == key: int Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232)) Join LeftSemi, Some((key#176228 = key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation default, t2, Some(b) MetastoreRelation default, t1, Some(c) == Optimized Logical Plan == Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232)) Project [key#176228] Join LeftSemi, Some((key#176228 =
[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652661#comment-14652661 ] Apache Spark commented on SPARK-9228: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7908 Combine unsafe and codegen into a single option --- Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9516: --- Assignee: (was: Apache Spark) Improve Thread Dump page Key: SPARK-9516 URL: https://issues.apache.org/jira/browse/SPARK-9516 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Nan Zhu Originally proposed by [~irashid] in https://github.com/apache/spark/pull/7808#issuecomment-126788335: we can enhance the current thread dump page with at least the following two new features: 1) sort threads by thread status, 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9516: --- Assignee: Apache Spark Improve Thread Dump page Key: SPARK-9516 URL: https://issues.apache.org/jira/browse/SPARK-9516 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Nan Zhu Assignee: Apache Spark Originally proposed by [~irashid] in https://github.com/apache/spark/pull/7808#issuecomment-126788335: we can enhance the current thread dump page with at least the following two new features: 1) sort threads by thread status, 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9054) Rename RowOrdering to InterpretedOrdering and use newOrdering to build orderings
[ https://issues.apache.org/jira/browse/SPARK-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-9054. -- Resolution: Won't Fix [~joshrosen] closing this as won't fix for now. We can reopen later if needed. Rename RowOrdering to InterpretedOrdering and use newOrdering to build orderings Key: SPARK-9054 URL: https://issues.apache.org/jira/browse/SPARK-9054 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen There are a few places where we still manually construct RowOrdering instead of using SparkPlan.newOrdering. We should update these to use newOrdering and should rename RowOrdering to InterpretedOrdering to make its function slightly more obvious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
[ https://issues.apache.org/jira/browse/SPARK-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9482: --- Sprint: Week 32 flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin --- Key: SPARK-9482 URL: https://issues.apache.org/jira/browse/SPARK-9482 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Yin Huai Priority: Blocker Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/ {code} Regression org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin Failing for the past 1 build (Since Failed#39059 ) Took 7.7 sec. Error Message Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key ASC], false 'Project [unresolvedalias('a.key)] 'Join RightOuter, Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b) 'UnresolvedRelation [t1], Some(c) == Analyzed Logical Plan == key: int Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation default, t2, Some(b)MetastoreRelation default, t1, Some(c) == Optimized Logical Plan == Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join LeftSemi, Some((key#176228 = key#176230)) Project [key#176228] MetastoreRelation default, t3, Some(a) Project [key#176230] MetastoreRelation default, t2, Some(b)Project [key#176232] MetastoreRelation default, t1, Some(c) == Physical Plan == ExternalSort [key#176228 ASC], false Project [key#176228] ConvertToSafe BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None ConvertToUnsafe Project [key#176228] ConvertToSafe BroadcastLeftSemiJoinHash [key#176228], [key#176230], None ConvertToUnsafe HiveTableScan [key#176228], (MetastoreRelation default, t3, Some(a)) ConvertToUnsafe HiveTableScan [key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c)) Code Generation: true == RDD == key !== HIVE - 31 row(s) == == CATALYST - 30 row(s) == 00 00 0 0 00 00 0 0 00 00 00 00 0 0 00 00 0 0 00 00 0 0 00 10 10 10 10 10 10 10 10 !48 !48 !8 NULL !8NULL NULL NULL NULL NULL NULL NULL NULL NULL !NULL Stacktrace sbt.ForkMain$ForkError: Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key ASC], false 'Project [unresolvedalias('a.key)] 'Join RightOuter, Some(('a.key = 'c.key)) 'Join LeftSemi, Some(('a.key = 'b.key)) 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b) 'UnresolvedRelation [t1], Some(c) == Analyzed Logical Plan == key: int Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232)) Join LeftSemi, Some((key#176228 = key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation default, t2, Some(b) MetastoreRelation default, t1, Some(c) == Optimized Logical Plan == Sort [key#176228 ASC], false Project [key#176228] Join RightOuter, Some((key#176228 = key#176232)) Project [key#176228] Join LeftSemi, Some((key#176228 = key#176230)) Project [key#176228] MetastoreRelation default, t3, Some(a) Project [key#176230] MetastoreRelation default, t2, Some(b) Project [key#176232] MetastoreRelation default, t1, Some(c) == Physical Plan == ExternalSort [key#176228 ASC], false Project [key#176228] ConvertToSafe
[jira] [Commented] (SPARK-8064) Upgrade Hive to 1.2
[ https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652730#comment-14652730 ] Steve Loughran commented on SPARK-8064: --- Also: we had to produce a custom release of hive-exec 1.2.1 with # The same version of Kryo as that used in Chill (2.21) # protobuf shaded (needed for co-existed with protobuf 2.4 on Hadoop 1.x) The source for this is at https://github.com/pwendell/hive/tree/release-1.2.1-spark Upgrade Hive to 1.2 --- Key: SPARK-8064 URL: https://issues.apache.org/jira/browse/SPARK-8064 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Steve Loughran Priority: Blocker Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9578) Stemmer feature transformer
Joseph K. Bradley created SPARK-9578: Summary: Stemmer feature transformer Key: SPARK-9578 URL: https://issues.apache.org/jira/browse/SPARK-9578 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Priority: Minor Transformer mentioned first in [SPARK-5571] based on suggestion from [~aloknsingh]. Very standard NLP preprocessing task. From [~aloknsingh]: {quote} We have one scala stemmer in scalanlp%chalk https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze which can easily copied (as it is apache project) and is in scala too. I think this will be better alternative than lucene englishAnalyzer or opennlp. Note: we already use the scalanlp%breeze via the maven dependency so I think adding scalanlp%chalk dependency is also the options. But as you had said we can copy the code as it is small. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652787#comment-14652787 ] Joseph K. Bradley commented on SPARK-5571: -- The stopwords transformer made it for 1.5, but the stemmer will need to be in 1.6. Just linked them. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns
[ https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8887: -- Target Version/s: 1.5.0 (was: 1.6.0) Explicitly define which data types can be used as dynamic partition columns --- Key: SPARK-8887 URL: https://issues.apache.org/jira/browse/SPARK-8887 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic partitioning insertion, which uses {{String.valueOf}} to write encode partition column values into dynamic partition directories. This actually limits the data types that can be used in partition column. For example, string representation of {{StructType}} values is not well defined. However, this limitation is not explicitly enforced. There are several things we can improve: # Enforce dynamic column data type requirements by adding analysis rules and throws {{AnalysisException}} when violation occurs. # Abstract away string representation of various data types, so that we don't need to convert internal representation types (e.g. {{UTF8String}}) to external types (e.g. {{String}}). A set of Hive compatible implementations should be provided to ensure compatibility with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns
[ https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-8887: - Assignee: Cheng Lian Explicitly define which data types can be used as dynamic partition columns --- Key: SPARK-8887 URL: https://issues.apache.org/jira/browse/SPARK-8887 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic partitioning insertion, which uses {{String.valueOf}} to write encode partition column values into dynamic partition directories. This actually limits the data types that can be used in partition column. For example, string representation of {{StructType}} values is not well defined. However, this limitation is not explicitly enforced. There are several things we can improve: # Enforce dynamic column data type requirements by adding analysis rules and throws {{AnalysisException}} when violation occurs. # Abstract away string representation of various data types, so that we don't need to convert internal representation types (e.g. {{UTF8String}}) to external types (e.g. {{String}}). A set of Hive compatible implementations should be provided to ensure compatibility with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9257) Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput
[ https://issues.apache.org/jira/browse/SPARK-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9257: --- Assignee: Yin Huai Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput -- Key: SPARK-9257 URL: https://issues.apache.org/jira/browse/SPARK-9257 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Minor {code} sqlContext.sql( |SELECT sum(value) |FROM agg1 |GROUP BY key .stripMargin).explain() == Physical Plan == Aggregate2Sort Some(List(key#510)), [key#510], [(sum(CAST(value#511, LongType))2,mode=Final,isDistinct=false)], [sum(CAST(value#511, LongType))#1435L], [sum(CAST(value#511, LongType))#1435L AS _c0#1426L] ExternalSort [key#510 ASC], false Exchange hashpartitioning(key#510) Aggregate2Sort None, [key#510], [(sum(CAST(value#511, LongType))2,mode=Partial,isDistinct=false)], [currentSum#1433L], [key#510,currentSum#1433L] ExternalSort [key#510 ASC], false PhysicalRDD [key#510,value#511], MapPartitionsRDD[97] at apply at Transformer.scala:22 sqlContext.sql( |SELECT sum(distinct value) |FROM agg1 |GROUP BY key .stripMargin).explain() == Physical Plan == !FinalAndCompleteAggregate2Sort [key#510,CAST(value#511, LongType)#1446L], [key#510], [(sum(CAST(value#511, LongType)#1446L)2,mode=Complete,isDistinct=false)], [sum(CAST(value#511, LongType))#1445L], [sum(CAST(value#511, LongType))#1445L AS _c0#1438L] Aggregate2Sort Some(List(key#510)), [key#510,CAST(value#511, LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L] ExternalSort [key#510 ASC,CAST(value#511, LongType)#1446L ASC], false Exchange hashpartitioning(key#510) !Aggregate2Sort None, [key#510,CAST(value#511, LongType) AS CAST(value#511, LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L] ExternalSort [key#510 ASC,CAST(value#511, LongType) AS CAST(value#511, LongType)#1446L ASC], false PhysicalRDD [key#510,value#511], MapPartitionsRDD[102] at apply at Transformer.scala:22 {code} For examples shown above, you can see there is a {{!}} at the bingeing of the operator's {{simpleString}}), which indicates that its {{missingInput}} is not empty. Actually, it is a false negative and we need to fix it. Also, it will be good to make these two operators' {{simpleString}} more reader friendly (people can tell what are grouping expressions, what are aggregate functions, and what is the mode of an aggregate function). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9251) do not order by expressions which still need evaluation
[ https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652675#comment-14652675 ] Apache Spark commented on SPARK-9251: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7906 do not order by expressions which still need evaluation --- Key: SPARK-9251 URL: https://issues.apache.org/jira/browse/SPARK-9251 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9513) Create Python API for all SQL functions
[ https://issues.apache.org/jira/browse/SPARK-9513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-9513: - Assignee: Davies Liu Create Python API for all SQL functions --- Key: SPARK-9513 URL: https://issues.apache.org/jira/browse/SPARK-9513 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Check all the SQL functions, make sure they have python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6116) DataFrame API improvement umbrella ticket (Spark 1.5)
[ https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6116: --- Summary: DataFrame API improvement umbrella ticket (Spark 1.5) (was: DataFrame API improvement umbrella ticket) DataFrame API improvement umbrella ticket (Spark 1.5) - Key: SPARK-6116 URL: https://issues.apache.org/jira/browse/SPARK-6116 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Labels: DataFrame An umbrella ticket to track improvements and changes needed to make DataFrame API non-experimental. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8416) Thread dump page should highlight Spark executor threads
[ https://issues.apache.org/jira/browse/SPARK-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8416. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7808 [https://github.com/apache/spark/pull/7808] Thread dump page should highlight Spark executor threads Key: SPARK-8416 URL: https://issues.apache.org/jira/browse/SPARK-8416 Project: Spark Issue Type: Bug Components: Web UI Reporter: Josh Rosen Fix For: 1.5.0 On the Spark thread dump page, it's hard to pick out executor threads from other system threads. The UI should employ some color coding or highlighting to make this more apparent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8466) Bug in SQL Optimizer: Unresolved Attribute after pushing Filter below Project
[ https://issues.apache.org/jira/browse/SPARK-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8466: Description: Input Data: a parquet file stored in hdfs:///data with two columns (lifeAverageBitrateKbps int, playtimems int) = Scripts used in spark-shell: {code} val df = sqlContext.parquetFile(hdfs:///data) import org.apache.spark.sql.types._ val cols = df.schema.fields.map { f = val dataType = f.dataType match { case DoubleType | FloatType = DecimalType.Unlimited case t = t } df.col(f.name).cast(dataType).as(f.name) } df.select(cols: _*).registerTempTable(t) val query = |SELECT avg(cleanedplaytimems), | count(1) |FROM | (SELECT 0 key, | avg(lifeAverageBitrateKbps) avgbitrate | FROM anon_sdm2_ss | WHERE lifeAverageBitrateKbps 0) t1, | (SELECT 0 key, | lifeAverageBitrateKbps, | if(playtimems 0, playtimems, 0) cleanedplaytimems | FROM anon_sdm2_ss | WHERE lifeAverageBitrateKbps 0) t2 |WHERE t1.key=t2.key | AND t2.lifeAverageBitrateKbps 0.5 * t1.avgbitrate .stripMargin sqlContext.sql(query).explain(true) {code} === Output: {code} == Analyzed Logical Plan == Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) AS _c1#112L] Filter ((key#107 = key#109) (CAST(lifeAverageBitrateKbps#105, DoubleType) (0.5 * avgbitrate#108))) Join Inner, None Subquery t1 Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) AS avgbitrate#108] Filter (lifeAverageBitrateKbps#105 0) Subquery anon_sdm2_ss Project [CAST(lifeaveragebitratekbps#27, IntegerType) AS lifeaveragebitratekbps#105,CAST(playtimems#89, IntegerType) AS playtimems#106] Relation[lifeaveragebitratekbps#27,playtimems#89] ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None) Subquery t2 Project [0 AS key#109,lifeAverageBitrateKbps#105,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((playtimems#106 0),playtimems#106,0) AS cleanedplaytimems#110] Filter (lifeAverageBitrateKbps#105 0) Subquery anon_sdm2_ss Project [CAST(lifeaveragebitratekbps#27, IntegerType) AS lifeaveragebitratekbps#105,CAST(playtimems#89, IntegerType) AS playtimems#106] Relation[lifeaveragebitratekbps#27,playtimems#89] ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None) == Optimized Logical Plan == Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) AS _c1#112L] Project [cleanedplaytimems#110] Join Inner, Some(((key#107 = key#109) (CAST(lifeAverageBitrateKbps#105, DoubleType) (0.5 * avgbitrate#108 Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) AS avgbitrate#108] Project [lifeaveragebitratekbps#27 AS lifeaveragebitratekbps#105] !Filter (lifeAverageBitrateKbps#105 0) Relation[lifeaveragebitratekbps#27,playtimems#89] ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None) Project [0 AS key#109,lifeaveragebitratekbps#27 AS lifeaveragebitratekbps#105,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((playtimems#89 AS playtimems#106 0),playtimems#89 AS playtimems#106,0) AS cleanedplaytimems#110] !Filter (lifeAverageBitrateKbps#105 0) Relation[lifeaveragebitratekbps#27,playtimems#89] ParquetRelation2(WrappedArray(hdfs:///data),Map(),None,None) {code} Note: Filter is unresolved was: Input Data: a parquet file stored in hdfs:///data with two columns (lifeAverageBitrateKbps int, playtimems int) = Scripts used in spark-shell: val df = sqlContext.parquetFile(hdfs:///data) import org.apache.spark.sql.types._ val cols = df.schema.fields.map { f = val dataType = f.dataType match { case DoubleType | FloatType = DecimalType.Unlimited case t = t } df.col(f.name).cast(dataType).as(f.name) } df.select(cols: _*).registerTempTable(t) val query = |SELECT avg(cleanedplaytimems), | count(1) |FROM | (SELECT 0 key, | avg(lifeAverageBitrateKbps) avgbitrate | FROM anon_sdm2_ss | WHERE lifeAverageBitrateKbps 0) t1, | (SELECT 0 key, | lifeAverageBitrateKbps, | if(playtimems 0, playtimems, 0) cleanedplaytimems | FROM anon_sdm2_ss | WHERE lifeAverageBitrateKbps 0) t2 |WHERE t1.key=t2.key | AND t2.lifeAverageBitrateKbps 0.5 * t1.avgbitrate .stripMargin sqlContext.sql(query).explain(true) === Output: . == Analyzed Logical Plan == Aggregate [], [AVG(CAST(cleanedplaytimems#110, LongType)) AS _c0#111,COUNT(1) AS _c1#112L] Filter ((key#107 = key#109) (CAST(lifeAverageBitrateKbps#105, DoubleType) (0.5 * avgbitrate#108))) Join Inner, None Subquery t1 Aggregate [], [0 AS key#107,AVG(CAST(lifeAverageBitrateKbps#105, LongType)) AS avgbitrate#108]
[jira] [Updated] (SPARK-9582) LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9582: - Description: Small cleanups to LDA code and recent additions CC: [~fliang] was: LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous to DistributedLDAModel.likelihood. The former includes the log likelihood of the inferred topics, but the latter does not. This JIRA is for refactoring the former to separate out the log likelihood of the inferred topics. CC: [~fliang] LDA cleanups Key: SPARK-9582 URL: https://issues.apache.org/jira/browse/SPARK-9582 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Small cleanups to LDA code and recent additions CC: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652933#comment-14652933 ] partha bishnu commented on SPARK-9559: -- We tested on 1.4.1 and got same results i.e. a new executor JVM did not get started on the other worker node after the node running the jobs stopped running. So it seems a like a major defect. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9582) LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9582: - Priority: Minor (was: Major) LDA cleanups Key: SPARK-9582 URL: https://issues.apache.org/jira/browse/SPARK-9582 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous to DistributedLDAModel.likelihood. The former includes the log likelihood of the inferred topics, but the latter does not. This JIRA is for refactoring the former to separate out the log likelihood of the inferred topics. CC: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9582) LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9582: - Summary: LDA cleanups (was: Improve clarity of LocalLDAModel log likelihood methods) LDA cleanups Key: SPARK-9582 URL: https://issues.apache.org/jira/browse/SPARK-9582 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley LocalLDAModel.logLikelihood resembles that for gensim, but it is not analogous to DistributedLDAModel.likelihood. The former includes the log likelihood of the inferred topics, but the latter does not. This JIRA is for refactoring the former to separate out the log likelihood of the inferred topics. CC: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9584) HiveHBaseTableInputFormat can'be cached
meiyoula created SPARK-9584: --- Summary: HiveHBaseTableInputFormat can'be cached Key: SPARK-9584 URL: https://issues.apache.org/jira/browse/SPARK-9584 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Below exception occurs in Spark On HBase function. {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451] {quote} When an executor has many cores, the tasks belongs to same RDD will using the same InputFormat. But the HiveHBaseTableInputFormat is not thread safety. So I think we should add a config to enable cache InputFormat or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9585) HiveHBaseTableInputFormat can'be cached
[ https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652941#comment-14652941 ] Apache Spark commented on SPARK-9585: - User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/7918 HiveHBaseTableInputFormat can'be cached --- Key: SPARK-9585 URL: https://issues.apache.org/jira/browse/SPARK-9585 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Below exception occurs in Spark On HBase function. {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451] {quote} When an executor has many cores, the tasks belongs to same RDD will using the same InputFormat. But the HiveHBaseTableInputFormat is not thread safety. So I think we should add a config to enable cache InputFormat or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9585) HiveHBaseTableInputFormat can'be cached
meiyoula created SPARK-9585: --- Summary: HiveHBaseTableInputFormat can'be cached Key: SPARK-9585 URL: https://issues.apache.org/jira/browse/SPARK-9585 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Below exception occurs in Spark On HBase function. {quote} java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577 rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451] {quote} When an executor has many cores, the tasks belongs to same RDD will using the same InputFormat. But the HiveHBaseTableInputFormat is not thread safety. So I think we should add a config to enable cache InputFormat or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9228: --- Assignee: Apache Spark (was: Michael Armbrust) Combine unsafe and codegen into a single option --- Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7505: --- Target Version/s: 1.5.0 (was: 1.6.0) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc. Key: SPARK-7505 URL: https://issues.apache.org/jira/browse/SPARK-7505 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark, SQL Affects Versions: 1.3.1 Reporter: Nicholas Chammas Priority: Minor The PySpark docs for DataFrame need the following fixes and improvements: # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over {{\_\_getattr\_\_}} and change all our examples accordingly. # *We should say clearly that the API is experimental.* (That is currently not the case for the PySpark docs.) # We should provide an example of how to join and select from 2 DataFrames that have identically named columns, because it is not obvious: {code} df1 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I know}'])) df2 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I dunno}'])) df12 = df1.join(df2, df1['a'] == df2['a']) df12.select(df1['a'], df2['other']).show() a other 4 I dunno {code} # [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy] and [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort] should be marked as aliases if that's what they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7544) pyspark.sql.types.Row should implement __getitem__
[ https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7544: --- Parent Issue: SPARK-9576 (was: SPARK-6116) pyspark.sql.types.Row should implement __getitem__ -- Key: SPARK-7544 URL: https://issues.apache.org/jira/browse/SPARK-7544 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor Following from the related discussions in [SPARK-7505] and [SPARK-7133], the {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this {code} row['field'] {code} instead of this: {code} row.field {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5517) Add input types for Java UDFs
[ https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5517: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Add input types for Java UDFs - Key: SPARK-5517 URL: https://issues.apache.org/jira/browse/SPARK-5517 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7400) PortableDataStream UDT
[ https://issues.apache.org/jira/browse/SPARK-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7400: --- Parent Issue: SPARK-9576 (was: SPARK-6116) PortableDataStream UDT -- Key: SPARK-7400 URL: https://issues.apache.org/jira/browse/SPARK-7400 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Eron Wright Improve support for PortableDataStream in a DataFrame by implementing PortableDataStreamUDT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8802) Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8802: --- Target Version/s: 1.6.0 (was: 1.5.0) Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException -- Key: SPARK-8802 URL: https://issues.apache.org/jira/browse/SPARK-8802 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor There exist certain BigDecimals that can be converted into Spark SQL's Decimal class but which produce Decimals that cannot be converted back to BigDecimal without throwing NumberFormatException. For instance: {code} val x = BigDecimal(BigInt(18889465931478580854784), -2147483648) assert(Decimal(x).toBigDecimal === x) {code} will fail with an exception: {code} java.lang.NumberFormatException at java.math.BigDecimal.init(BigDecimal.java:511) at java.math.BigDecimal.init(BigDecimal.java:757) at scala.math.BigDecimal$.apply(BigDecimal.scala:119) at scala.math.BigDecimal.apply(BigDecimal.scala:324) at org.apache.spark.sql.types.Decimal.toBigDecimal(Decimal.scala:142) at org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply$mcV$sp(DecimalSuite.scala:62) at org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60) at org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9577) Surface concrete iterator types in various sort classes
Reynold Xin created SPARK-9577: -- Summary: Surface concrete iterator types in various sort classes Key: SPARK-9577 URL: https://issues.apache.org/jira/browse/SPARK-9577 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9577) Surface concrete iterator types in various sort classes
[ https://issues.apache.org/jira/browse/SPARK-9577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652768#comment-14652768 ] Apache Spark commented on SPARK-9577: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7911 Surface concrete iterator types in various sort classes --- Key: SPARK-9577 URL: https://issues.apache.org/jira/browse/SPARK-9577 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9577) Surface concrete iterator types in various sort classes
[ https://issues.apache.org/jira/browse/SPARK-9577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9577: --- Assignee: Reynold Xin (was: Apache Spark) Surface concrete iterator types in various sort classes --- Key: SPARK-9577 URL: https://issues.apache.org/jira/browse/SPARK-9577 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML
[ https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8874: - Shepherd: Joseph K. Bradley Target Version/s: 1.5.0 Add missing methods in Word2Vec ML -- Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Assignee: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9483) UTF8String.getPrefix only works in little-endian order
[ https://issues.apache.org/jira/browse/SPARK-9483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9483. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7902 [https://github.com/apache/spark/pull/7902] UTF8String.getPrefix only works in little-endian order -- Key: SPARK-9483 URL: https://issues.apache.org/jira/browse/SPARK-9483 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Matthew Brandyberry Priority: Critical Fix For: 1.5.0 There are 2 bit masking and a reverse bytes that should probably be handled differently on big-endian order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8891) Calling aggregation expressions on null literals fails at runtime
[ https://issues.apache.org/jira/browse/SPARK-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-8891. -- Resolution: Fixed Assignee: Yin Huai (was: Josh Rosen) Fix Version/s: 1.5.0 Fixed by Yin in new aggregates. Calling aggregation expressions on null literals fails at runtime - Key: SPARK-8891 URL: https://issues.apache.org/jira/browse/SPARK-8891 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0, 1.4.1, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 Queries that call aggregate expressions with null literals, such as {{select avg(null)}} or {{select sum(null)}} fail with various errors due to mishandling of the internal NullType type. For instance, with codegen disabled on a recent 1.5 master: {code} scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:407) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:426) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:426) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:428) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:268) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48) at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:147) at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:536) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:132) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:125) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} When codegen is enabled, the resulting code fails to compile. The fix for this issue involves changes to Cast and Sum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9526) Utilize randomized tests to reveal potential bugs in sql expressions
[ https://issues.apache.org/jira/browse/SPARK-9526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9526: --- Shepherd: Josh Rosen Assignee: Yijie Shen Utilize randomized tests to reveal potential bugs in sql expressions Key: SPARK-9526 URL: https://issues.apache.org/jira/browse/SPARK-9526 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yijie Shen Assignee: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9403) Implement code generation for In / InSet
[ https://issues.apache.org/jira/browse/SPARK-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9403: --- Shepherd: Davies Liu Implement code generation for In / InSet Key: SPARK-9403 URL: https://issues.apache.org/jira/browse/SPARK-9403 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin In expression doesn't have any code generation. Would be great to code gen those. Note that we should also optimize the generated code for literal types (InSet). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9581) Add test for JSON UDTs
[ https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9581: --- Assignee: Apache Spark (was: Reynold Xin) Add test for JSON UDTs -- Key: SPARK-9581 URL: https://issues.apache.org/jira/browse/SPARK-9581 Project: Spark Issue Type: Test Components: SQL Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9581) Add test for JSON UDTs
[ https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9581: --- Assignee: Reynold Xin (was: Apache Spark) Add test for JSON UDTs -- Key: SPARK-9581 URL: https://issues.apache.org/jira/browse/SPARK-9581 Project: Spark Issue Type: Test Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9581) Add test for JSON UDTs
[ https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652899#comment-14652899 ] Apache Spark commented on SPARK-9581: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7917 Add test for JSON UDTs -- Key: SPARK-9581 URL: https://issues.apache.org/jira/browse/SPARK-9581 Project: Spark Issue Type: Test Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7119) ScriptTransform doesn't consider the output data type
[ https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7119: Target Version/s: 1.5.0 (was: 1.6.0) ScriptTransform doesn't consider the output data type - Key: SPARK-7119 URL: https://issues.apache.org/jira/browse/SPARK-7119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Hao Priority: Critical {code:sql} from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2; {code} {noformat} 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-7148: --- Comment: was deleted (was: [~josephkb] If you are busy with other issues, please don't hesitate to assign it to me.) Configure Parquet block size (row group size) for ML model import/export Key: SPARK-7148 URL: https://issues.apache.org/jira/browse/SPARK-7148 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor It would be nice if we could configure the Parquet buffer size when using Parquet format for ML model import/export. Currently, for some models (trees and ensembles), the schema has 13+ columns. With a default buffer size of 128MB (I think), that puts the allocated buffer way over the default memory made available by run-example. Because of this problem, users have to use spark-submit and explicitly use a larger amount of memory in order to run some ML examples. Is there a simple way to specify {{parquet.block.size}}? I'm not familiar with this part of SparkSQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org