[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281543#comment-14281543 ] Travis Galoppo commented on SPARK-5019: --- This ticket is currently stalling SPARK-5012. Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281564#comment-14281564 ] Grzegorz Dubicki commented on SPARK-5298: - Btw: I used the fork because I was misguided by Github which says: mesos/spark-ec2 [is] forked from shivaram/spark-ec2 on https://github.com/mesos/spark-ec2 - I assumed that shivaram/spark2-ec2 is the source - newer offical version then.. Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281535#comment-14281535 ] Grzegorz Dubicki edited comment on SPARK-5298 at 1/17/15 8:53 PM: -- Ad. 1. I am sorry, I have not noticed the warnings. I would not use unsupported instance if I would knew that. It would be nice if the script would ask me something like Not supported instance type. Continue anyway?... But switching to m3.medium didn't help. Launch output still includes the ERROR: Unknown Spark version message. See it whole here: https://gist.github.com/grzegorz-dubicki/4959eb97f9b1ca8e00ad And still there is actually no Spark on the master: {noformat} root@ip-172-31-47-137 ~]$ ls spark conf work {noformat} Trying to apply your suggestion no 2... was (Author: grzegorz-dubicki): Ad. 1. I am sorry, I have not noticed the warnings. I would not use unsupported instance if I would knew that. It would be nice if the script would ask me something like Not supported instance type. Continue anyway?... But switching to m3.medium didn't help. Launch output still includes the ERROR: Unknown Spark version message. See it whole here: https://gist.github.com/grzegorz-dubicki/4959eb97f9b1ca8e00ad And still there is actually no Spark on the master: {noformat} root@ip-172-31-47-137 ~]$ ls spark conf work {noformat} Trying to apply your suggestion no 2... Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281540#comment-14281540 ] Nicholas Chammas commented on SPARK-5298: - Ah, I found the issue. You have an outdated fork of {{mesos/spark-ec2}}. See here: https://github.com/grzegorz-dubicki/spark-ec2/blob/b388d5b22462d4b5bfc9f021f160cd438c98f2c1/spark/init.sh#L98 Please re-fork the {{v4}} branch and try again. Correct version: https://github.com/mesos/spark-ec2/blob/c8b470929838132cae6f9872eeb459d7924f1978/spark/init.sh#L105 Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-5298. - Resolution: Invalid I'm resolving this as invalid. If you believe this is incorrect, please feel free to reopen with clarification. Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5299) Is http://www.apache.org/dist/spark/KEYS out of date?
[ https://issues.apache.org/jira/browse/SPARK-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281511#comment-14281511 ] Nicholas Chammas commented on SPARK-5299: - cc [~pwendell] Is http://www.apache.org/dist/spark/KEYS out of date? - Key: SPARK-5299 URL: https://issues.apache.org/jira/browse/SPARK-5299 Project: Spark Issue Type: Question Components: Deploy Reporter: David Shaw The keys contained in http://www.apache.org/dist/spark/KEYS do not appear to match the keys used to sign the releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281514#comment-14281514 ] Nicholas Chammas commented on SPARK-5298: - A few questions for you: 1. What happens if you try to launch on {{m3.medium}} instances? {{t2.micro}} is not fully supported by {{spark-ec2}}, as the warning hints at. The error about Shark is harmless since Shark doesn't exist as of 1.2.0. This error won't show up anymore in 1.3.0. The error about Spark is strange since you passed in the version correctly. 2. What happens if you launch without explicitly setting the version? 3. What happens if you launch into the {{us-east-1}} region? Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281535#comment-14281535 ] Grzegorz Dubicki commented on SPARK-5298: - Ad. 1. I am sorry, I have not noticed the warnings. I would not use unsupported instance if I would knew that. It would be nice if the script would ask me something like Not supported instance type. Continue anyway?... But switching to m3.medium didn't help. Launch output still includes the ERROR: Unknown Spark version message. See it whole here: https://gist.github.com/grzegorz-dubicki/4959eb97f9b1ca8e00ad And still there is actually no Spark on the master: {noformat} root@ip-172-31-47-137 ~]$ ls spark conf work {noformat} Trying to apply your suggestion no 2... Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281565#comment-14281565 ] Joseph K. Bradley commented on SPARK-5019: -- [~tgaloppo] I'd recommend going ahead and submitting a PR if you have it prepared. It will be good to finalize soon since the code freeze for the next release is scheduled to be at the end of this month. Thanks for being patient! Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281557#comment-14281557 ] Grzegorz Dubicki commented on SPARK-5298: - Ad. 2. No progress. I put the output in the same gist as previously as a new commit for a free diff. Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281557#comment-14281557 ] Grzegorz Dubicki edited comment on SPARK-5298 at 1/17/15 9:15 PM: -- EDIT: Thank you, I will try to update my fork. was (Author: grzegorz-dubicki): Ad. 2. No progress. I put the output in the same gist as previously as a new commit for a free diff. Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Dubicki closed SPARK-5298. --- Switching to mesos/spark-ec2 as a base of my fork helped. :) Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281570#comment-14281570 ] Apache Spark commented on SPARK-5019: - User 'tgaloppo' has created a pull request for this issue: https://github.com/apache/spark/pull/4088 Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281575#comment-14281575 ] Nicholas Chammas commented on SPARK-5298: - Yes, {{mesos/spark-ec2}} is the official repo. You'll see that {{spark-ec2}} from the main Spark repository points to it. Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
Reza Zadeh created SPARK-5301: - Summary: Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
[ https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-5301: -- Description: 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix - Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
[ https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281577#comment-14281577 ] Apache Spark commented on SPARK-5301: - User 'rezazadeh' has created a pull request for this issue: https://github.com/apache/spark/pull/4089 Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix - Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
Ewan Higgs created SPARK-5300: - Summary: Spark loads file partitions in inconsistent order on native filesystems Key: SPARK-5300 URL: https://issues.apache.org/jira/browse/SPARK-5300 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0, 1.1.0 Environment: Linux, EXT4, for example. Reporter: Ewan Higgs Discussed on user list in April 2014: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html And on dev list January 2015: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html When using a file system which isn't HDFS, file partitions ('part-0, part-1', etc.) are not guaranteed to load in the same order. This means previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4937) Adding optimization to simplify the filter condition
[ https://issues.apache.org/jira/browse/SPARK-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4937: --- Description: Adding optimization to simplify the filter condition: 1 condition that can get the boolean result such as: {code} a 3 a 5 False a 1 || a 0 True {code} 2 Simplify And, Or condition, such as the sql (one of hive-testbench): {code} select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity = 7 and l_quantity = 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity = 15 and l_quantity = 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity = 26 and l_quantity = 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ); {code} Before optimized it is a CartesianProduct, in my locally test this sql hang and can not get result, after optimization the CartesianProduct replaced by ShuffledHashJoin, which only need 20+ seconds to run this sql. was: Adding optimization to simplify the filter condition: 1 condition that can get the boolean result such as: a 3 a 5 False a 1 || a 0 True 2 Simplify And, Or condition, such as the sql (one of hive-testbench ): select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity = 7 and l_quantity = 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity = 15 and l_quantity = 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity = 26 and l_quantity = 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ); Before optimized it is a CartesianProduct, in my locally test this sql hang and can not get result, after optimization the CartesianProduct replaced by ShuffledHashJoin, which only need 20+ seconds to run this sql. Adding optimization to simplify the filter condition Key: SPARK-4937 URL: https://issues.apache.org/jira/browse/SPARK-4937 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian Fix For: 1.3.0 Adding optimization to simplify the filter condition: 1 condition that can get the boolean result such as: {code} a 3 a 5 False a 1 || a 0 True {code} 2 Simplify And, Or condition, such as the sql (one of hive-testbench): {code} select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity = 7 and l_quantity = 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity = 15 and l_quantity = 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity = 26
[jira] [Created] (SPARK-5305) Using a field in a WHERE clause that is not in the schema does not throw an exception.
Corey J. Nolet created SPARK-5305: - Summary: Using a field in a WHERE clause that is not in the schema does not throw an exception. Key: SPARK-5305 URL: https://issues.apache.org/jira/browse/SPARK-5305 Project: Spark Issue Type: Bug Components: SQL Reporter: Corey J. Nolet Given a schema: key1 = String key2 = Integer The following sql statement doesn't seem to throw an exception: SELECT * FROM myTable WHERE doesntExist = 'val1' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5302) Add support for SQLContext partition columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-5302: --- Description: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions. The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. (was: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions. The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly.) Add support for SQLContext partition columns -- Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions. The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5304) applySchema returns NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mauro Pirrone closed SPARK-5304. Resolution: Duplicate applySchema returns NullPointerException Key: SPARK-5304 URL: https://issues.apache.org/jira/browse/SPARK-5304 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Mauro Pirrone The following code snippet returns NullPointerException: val result = . val rows = result.take(10) val rowRdd = SparkManager.getContext().parallelize(rows, 1) val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, result.schema) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168) at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216) at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53) at scala.collection.immutable.List.hashCode(List.scala:84) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63) at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210) at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172) at org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.get(HashMap.scala:69) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135) at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5302) Add support for SQLContext partition columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-5302: --- Description: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. (was: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions. The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly.) Add support for SQLContext partition columns -- Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown
[ https://issues.apache.org/jira/browse/SPARK-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5306: -- Component/s: SQL Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown --- Key: SPARK-5306 URL: https://issues.apache.org/jira/browse/SPARK-5306 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet This would be a pretty significant addition to the Filters that get pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5306) Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown
Corey J. Nolet created SPARK-5306: - Summary: Support for a NotEqualsFilter in the filter PrunedFilteredScan pushdown Key: SPARK-5306 URL: https://issues.apache.org/jira/browse/SPARK-5306 Project: Spark Issue Type: Improvement Reporter: Corey J. Nolet This would be a pretty significant addition to the Filters that get pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5304) applySchema returns NullPointerException
Mauro Pirrone created SPARK-5304: Summary: applySchema returns NullPointerException Key: SPARK-5304 URL: https://issues.apache.org/jira/browse/SPARK-5304 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Mauro Pirrone The following code snippet returns NullPointerException: val result = . val rows = result.take(10) val rowRdd = SparkManager.getContext().parallelize(rows, 1) val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, result.schema) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168) at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216) at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53) at scala.collection.immutable.List.hashCode(List.scala:84) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63) at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210) at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172) at org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.get(HashMap.scala:69) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135) at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5303) applySchema returns NullPointerException
Mauro Pirrone created SPARK-5303: Summary: applySchema returns NullPointerException Key: SPARK-5303 URL: https://issues.apache.org/jira/browse/SPARK-5303 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Mauro Pirrone The following code snippet returns NullPointerException: val result = . val rows = result.take(10) val rowRdd = SparkManager.getContext().parallelize(rows, 1) val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, result.schema) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168) at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216) at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53) at scala.collection.immutable.List.hashCode(List.scala:84) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63) at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210) at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172) at org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58) at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) at scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.get(HashMap.scala:69) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44) at org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135) at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5307) Add utility to help with NotSerializableException debugging
Reynold Xin created SPARK-5307: -- Summary: Add utility to help with NotSerializableException debugging Key: SPARK-5307 URL: https://issues.apache.org/jira/browse/SPARK-5307 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Scala closures can easily capture objects unintentionally, especially with implicit arguments. I think we can do more than just relying on the users being smart about using sun.io.serialization.extendedDebugInfo to find more debug information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5302) Add support for SQLContext partition columns
Bob Tiernay created SPARK-5302: -- Summary: Add support for SQLContext partition columns Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions. The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3694. Resolution: Duplicate Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5302) Add support for SQLContext partition columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-5302: --- Description: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a field of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. was:For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. Add support for SQLContext partition columns -- Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a field of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5302) Add support for SQLContext partition columns
[ https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Tiernay updated SPARK-5302: --- Description: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a column of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. was: For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a field of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. Add support for SQLContext partition columns -- Key: SPARK-5302 URL: https://issues.apache.org/jira/browse/SPARK-5302 Project: Spark Issue Type: New Feature Components: SQL Reporter: Bob Tiernay For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to support a virtual column that maps to part of the the file path, similar to what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} where {{dt}} is a column of type {{TEXT}}). The API could allow the user to type the column using an appropriate {{DataType}} instance. This new field could be addressed in SQL statements much the same as is done in Hive. As a consequence, pruning of partitions could be possible when executing a query and also remove the need to materialize a column in each logical partition that is already encoded in the path name. Furthermore, this would provide an nice interop and migration strategy for Hive users who may one day use {{SQLContext}} directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir
[ https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5096. Resolution: Fixed Fix Version/s: 1.3.0 SparkBuild.scala assumes you are at the spark root dir -- Key: SPARK-5096 URL: https://issues.apache.org/jira/browse/SPARK-5096 Project: Spark Issue Type: Bug Components: Build Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.3.0 This is bad because it breaks compiling spark as an external project ref and is generally bad SBT practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir
[ https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5096: --- Target Version/s: (was: 1.0.3) SparkBuild.scala assumes you are at the spark root dir -- Key: SPARK-5096 URL: https://issues.apache.org/jira/browse/SPARK-5096 Project: Spark Issue Type: Bug Components: Build Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.3.0 This is bad because it breaks compiling spark as an external project ref and is generally bad SBT practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5279) Use java.math.BigDecimal as the exposed Decimal type
[ https://issues.apache.org/jira/browse/SPARK-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281645#comment-14281645 ] Apache Spark commented on SPARK-5279: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4092 Use java.math.BigDecimal as the exposed Decimal type Key: SPARK-5279 URL: https://issues.apache.org/jira/browse/SPARK-5279 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Change it from scala.BigDecimal to java.math.BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5289. Resolution: Fixed Backport publishing of repl, yarn into branch-1.2 - Key: SPARK-5289 URL: https://issues.apache.org/jira/browse/SPARK-5289 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters
[ https://issues.apache.org/jira/browse/SPARK-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281621#comment-14281621 ] Corey J. Nolet commented on SPARK-5296: --- The more I'm thinking about this- It would be nice if there was a tree pushed down for the filters instead of an Array. This is a significant change to the API so it would still probably be easiest to create a new class (PrunedFilteredTreeScan?). Probably easiest to have AndFilter and OrFilter parent nodes that can be arbitrarily nested with the leaf nodes being the filters that are already used (hopefully with the addition of the NotEqualsFilter from SPARK-5306). Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters -- Key: SPARK-5296 URL: https://issues.apache.org/jira/browse/SPARK-5296 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Currently, the BaseRelation API allows a FilteredRelation to handle an Array[Filter] which represents filter expressions that are applied as an AND operator. We should support OR operations in a BaseRelation as well. I'm not sure what this would look like in terms of API changes, but it almost seems like a FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4920: -- Target Version/s: 1.0.3 (was: 1.0.3, 1.2.1) current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Labels: backport-needed Fix For: 1.1.1, 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4920: -- Target Version/s: 1.0.3, 1.2.1 (was: 1.1.1, 1.0.3, 1.2.1) Fix Version/s: 1.1.1 current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Labels: backport-needed Fix For: 1.1.1, 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Issue Type: Bug (was: Improvement) Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Bug Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! !Screen Shot 2015-01-12 at 11.34.30 AM.png! !Screen Shot 2015-01-12 at 11.34.41 AM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5221) FileInputDStream remember window in certain situations causes files to be ignored
[ https://issues.apache.org/jira/browse/SPARK-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jem Tucker updated SPARK-5221: -- Priority: Major (was: Minor) FileInputDStream remember window in certain situations causes files to be ignored Key: SPARK-5221 URL: https://issues.apache.org/jira/browse/SPARK-5221 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.1, 1.2.0 Reporter: Jem Tucker When batch times are greater than 1 minute, if a file begins to be moved into a directory just before FileInputDStream.findNewFiles() is called but does not become visible untill after it has excecuted and therefore is not included in that batch, the file is then ignored in the following batch as its mod time is less than the modTimeIgnoreThreshold. This causes data to be ignored in spark streaming that shouldnt be, especially when large files are being moved into the directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281361#comment-14281361 ] François Garillot commented on SPARK-1812: -- Hem. Both issues are now closed. Pinging [~pwendell]. Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Fix For: 1.2.0 Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4937) Adding optimization to simplify the filter condition
[ https://issues.apache.org/jira/browse/SPARK-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281384#comment-14281384 ] Apache Spark commented on SPARK-4937: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4086 Adding optimization to simplify the filter condition Key: SPARK-4937 URL: https://issues.apache.org/jira/browse/SPARK-4937 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian Fix For: 1.3.0 Adding optimization to simplify the filter condition: 1 condition that can get the boolean result such as: a 3 a 5 False a 1 || a 0 True 2 Simplify And, Or condition, such as the sql (one of hive-testbench ): select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity = 7 and l_quantity = 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity = 15 and l_quantity = 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity = 26 and l_quantity = 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ); Before optimized it is a CartesianProduct, in my locally test this sql hang and can not get result, after optimization the CartesianProduct replaced by ShuffledHashJoin, which only need 20+ seconds to run this sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5297) File Streams do not work with custom key/values
Leonidas Fegaras created SPARK-5297: --- Summary: File Streams do not work with custom key/values Key: SPARK-5297 URL: https://issues.apache.org/jira/browse/SPARK-5297 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Leonidas Fegaras Priority: Minor Fix For: 1.2.0 The following code: {code} stream_context.K,V,SequenceFileInputFormatK,VfileStream(directory) .foreachRDD(new FunctionJavaPairRDDK,V,Void() { public Void call ( JavaPairRDDK,V rdd ) throws Exception { for ( Tuple2K,V x: rdd.collect() ) System.out.println(# +x._1+ +x._2); return null; } }); stream_context.start(); stream_context.awaitTermination(); {code} for custom (serializable) classes K and V compiles fine but gives an error when I drop a new hadoop sequence file in the directory: {quote} 15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for time 1421507639000 ms java.lang.ClassCastException: java.lang.Object cannot be cast to org.apache.hadoop.mapreduce.InputFormat at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) {quote} The same classes K and V work fine for non-streaming Spark: {code} spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf) {code} also streaming works fine for TextFileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281441#comment-14281441 ] Apache Spark commented on SPARK-4894: - User 'leahmcguire' has created a pull request for this issue: https://github.com/apache/spark/pull/4087 Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: RJ Nowling Assignee: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281444#comment-14281444 ] Leah McGuire commented on SPARK-4894: - Hi [~rnowling], I submitted a pull request to add just the Bernoulli NB to the existing code. It follows the outline you suggested above with the exemption of the fact that I used an enumeration for the model type rather than a simple string. If you would have time to review it I would appreciate the feedback! Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: RJ Nowling Assignee: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5298) Spark not starting on EC2 using spark-ec2
Grzegorz Dubicki created SPARK-5298: --- Summary: Spark not starting on EC2 using spark-ec2 Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5298) Spark not starting on EC2 using spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Dubicki updated SPARK-5298: Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Affects Version/s: 1.2.0 Spark not starting on EC2 using spark-ec2 - Key: SPARK-5298 URL: https://issues.apache.org/jira/browse/SPARK-5298 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: I use Spark 1.2.0 + this PR https://github.com/mesos/spark-ec2/pull/76 from my fork https://github.com/grzegorz-dubicki/spark and v4 Spark EC2 script with the same fix from https://github.com/grzegorz-dubicki/spark-ec2 Reporter: Grzegorz Dubicki Spark doesn't start after creating it with: {noformat} ./spark-ec2 -k * -i * -s 1 --region=eu-west-1 --instance-type=t2.micro --spark-version=1.2.0 launch test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/f15caf9ff6c96ec69fee) ..or after stopping the instances on EC2 via AWS Console and starting the cluster with: {noformat} ./spark-ec2 -k * -i * --region=eu-west-1 start test2 {noformat} (Output: https://gist.github.com/grzegorz-dubicki/8b87192b3aa4e0ed028c) Please note these errors in launch output: {noformat} ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version {noformat} ..and then these in start output: {noformat} ./spark-standalone/setup.sh: line 26: /root/spark/sbin/stop-all.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 31: /root/spark/sbin/start-master.sh: Nie ma takiego pliku ani katalogu ./spark-standalone/setup.sh: line 37: /root/spark/sbin/start-slaves.sh: Nie ma takiego pliku ani katalogu {noformat} (the error message is No such file or directory, in Polish) It seems to be related with http://mail-archives.us.apache.org/mod_mbox/spark-user/201412.mbox/%3cCAJ5A9B_U=mdcxyftdkbk+sljzbcdpcb0qqs83u0grozfgkc...@mail.gmail.com%3e - I also have almost empty Spark and Shark dirs on the master of test2 cluster: {noformat} root@ip-172-31-7-179 ~]$ ls spark conf work root@ip-172-31-7-179 ~]$ ls shark/ conf {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5299) Is http://www.apache.org/dist/spark/KEYS out of date?
David Shaw created SPARK-5299: - Summary: Is http://www.apache.org/dist/spark/KEYS out of date? Key: SPARK-5299 URL: https://issues.apache.org/jira/browse/SPARK-5299 Project: Spark Issue Type: Question Components: Deploy Reporter: David Shaw The keys contained in http://www.apache.org/dist/spark/KEYS do not appear to match the keys used to sign the releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org