[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596794#comment-14596794 ] Anant Daksh Asthana commented on SPARK-4038: I do agree a general wrapper might be quite involved. It may be wise to create a toolkit of algorithms and just document them well. Follow the patterns to make them all compatible with mlpipe. What do you think of that? On Mon, Jun 22, 2015, 4:14 PM Joseph K. Bradley (JIRA) j...@apache.org Outlier Detection Algorithm for MLlib - Key: SPARK-4038 URL: https://issues.apache.org/jira/browse/SPARK-4038 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ashutosh Trivedi Priority: Minor The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper a parallel version is also given, which is not complected to implement. I am working on the implementation and soon submit the initial code for review. Here is the Link for the paper http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 As pointed out by Xiangrui in discussion http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html There are other algorithms also. Lets discuss about which will be more general and easily paralleled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596830#comment-14596830 ] Anant Daksh Asthana commented on SPARK-4038: So what are some good first algorithms in your opinion? Avf, kmeans, k-nearest neighbor based algorithms or maybe LOF? I think avf and kmeans might be a good starting point. On Mon, Jun 22, 2015, 5:09 PM Joseph K. Bradley (JIRA) j...@apache.org Outlier Detection Algorithm for MLlib - Key: SPARK-4038 URL: https://issues.apache.org/jira/browse/SPARK-4038 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ashutosh Trivedi Priority: Minor The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper a parallel version is also given, which is not complected to implement. I am working on the implementation and soon submit the initial code for review. Here is the Link for the paper http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 As pointed out by Xiangrui in discussion http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html There are other algorithms also. Lets discuss about which will be more general and easily paralleled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4649) Add method unionAll to PySpark's SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228640#comment-14228640 ] Anant Daksh Asthana commented on SPARK-4649: I would like to take eon this task. Add method unionAll to PySpark's SchemaRDD --- Key: SPARK-4649 URL: https://issues.apache.org/jira/browse/SPARK-4649 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.1.0 Reporter: Luca Foschini Priority: Minor PySpark has no equivalent of Scala's SchemaRDD.unionAll. The standard SchemaRDD.union method downcasts the result to UnionRDD which makes it not amenable for chaining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220232#comment-14220232 ] Anant Daksh Asthana commented on SPARK-4038: So AVF is based on k-modes for detecting outliers which is similar in spirit to k-means. We could add the k-modes algorithm and have the avf outlier detection as an add on or extension to it. We could do a similar thing detecting outliers with k-means etc too Outlier Detection Algorithm for MLlib - Key: SPARK-4038 URL: https://issues.apache.org/jira/browse/SPARK-4038 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ashutosh Trivedi Priority: Minor The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper a parallel version is also given, which is not complected to implement. I am working on the implementation and soon submit the initial code for review. Here is the Link for the paper http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 As pointed out by Xiangrui in discussion http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html There are other algorithms also. Lets discuss about which will be more general and easily paralleled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192996#comment-14192996 ] Anant Daksh Asthana commented on SPARK-4127: [~mengxr][~freeman-lab] I am running into some issues. Wondering if you could help. I have pushed some changes to my branch https://github.com/anantasty/spark/tree/SPARK-4127 I added functions to the PythonMLLibAPI.scala and to the python/pyspark/mllib/regression.py I added an example similar to the scala one. When i run it I get an error java.lang.ClassCastException: [B cannot be cast to org.apache.spark.mllib.linalg.Vector Which I am not sure how to work with. There are plenty examples where Python SparseVectors and DenseVectors are passed over in RDD's and work just fine. Also the training data is sent as a pair of Double, Vector and works fine. But on the test_data (model.predictOn) it throws the exception. Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Summary: Streaming Linear Regression- Python bindings (was: Streaming Linear Regression) Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4108) Fix uses of @deprecated in catalyst dataTypes
[ https://issues.apache.org/jira/browse/SPARK-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4108: --- Component/s: SQL Fix uses of @deprecated in catalyst dataTypes - Key: SPARK-4108 URL: https://issues.apache.org/jira/browse/SPARK-4108 Project: Spark Issue Type: Task Components: SQL Reporter: Anant Daksh Asthana Priority: Trivial @deprecated takes 2 parameters message and version sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala has a usage of @deprecated with just one parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4118) Create python bindings for Streaming KMeans
Anant Daksh Asthana created SPARK-4118: -- Summary: Create python bindings for Streaming KMeans Key: SPARK-4118 URL: https://issues.apache.org/jira/browse/SPARK-4118 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create Python bindings for Streaming K-means This is in reference to https://issues.apache.org/jira/browse/SPARK-3254 which adds Streaming K-means functionality to MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4127) Streaming Linear Regression
Anant Daksh Asthana created SPARK-4127: -- Summary: Streaming Linear Regression Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187906#comment-14187906 ] Anant Daksh Asthana commented on SPARK-4127: [~mengxr] [~freeman-lab] Just added this issue. Could you please assign it to me. Thanks Streaming Linear Regression --- Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Description: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] was:Create python bindings for Streaming Linear Regression (MLlib). Streaming Linear Regression --- Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Description: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala was: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] Streaming Linear Regression --- Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4108) Fix uses od @deprecated in catalyst dataTypes
Anant Daksh Asthana created SPARK-4108: -- Summary: Fix uses od @deprecated in catalyst dataTypes Key: SPARK-4108 URL: https://issues.apache.org/jira/browse/SPARK-4108 Project: Spark Issue Type: Task Reporter: Anant Daksh Asthana Priority: Trivial @deprecated takes 2 parameters message and version sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala has a usage of @deprecated with just one parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4108) Fix uses of @deprecated in catalyst dataTypes
[ https://issues.apache.org/jira/browse/SPARK-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4108: --- Summary: Fix uses of @deprecated in catalyst dataTypes (was: Fix uses od @deprecated in catalyst dataTypes) Fix uses of @deprecated in catalyst dataTypes - Key: SPARK-4108 URL: https://issues.apache.org/jira/browse/SPARK-4108 Project: Spark Issue Type: Task Reporter: Anant Daksh Asthana Priority: Trivial @deprecated takes 2 parameters message and version sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala has a usage of @deprecated with just one parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186347#comment-14186347 ] Anant Daksh Asthana commented on SPARK-2335: [~Rusty][~bgawalt] I would be willing to help in this implementation as well. Thanks k-Nearest Neighbor classification and regression for MLLib -- Key: SPARK-2335 URL: https://issues.apache.org/jira/browse/SPARK-2335 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: features, newbie The k-Nearest Neighbor model for classification and regression problems is a simple and intuitive approach, offering a straightforward path to creating non-linear decision/estimation contours. It's downsides -- high variance (sensitivity to the known training data set) and computational intensity for estimating new point labels -- both play to Spark's big data strengths: lots of data mitigates data concerns; lots of workers mitigate computational latency. We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184750#comment-14184750 ] Anant Daksh Asthana commented on SPARK-3838: Pull request for resolution can be found at https://github.com/apache/spark/pull/2952 Python code example for Word2Vec in user guide -- Key: SPARK-3838 URL: https://issues.apache.org/jira/browse/SPARK-3838 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Anant Daksh Asthana Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2396) Spark EC2 scripts fail when trying to log in to EC2 instances
[ https://issues.apache.org/jira/browse/SPARK-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184803#comment-14184803 ] Anant Daksh Asthana commented on SPARK-2396: Seems like a python issue on your system. You are missing the subprocess module. Spark EC2 scripts fail when trying to log in to EC2 instances - Key: SPARK-2396 URL: https://issues.apache.org/jira/browse/SPARK-2396 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.0 Environment: Windows 8, Cygwin and command prompt, Python 2.7 Reporter: Stephen M. Hopper Labels: aws, ec2, ssh I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2 script. I'm using variations on the following command: ./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 --spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch spark-test-cluster The script always allocates the EC2 instances without much trouble, but can never seem to complete the SSH step to install Spark on the cluster. It always complains about my SSH key. If I try to log in with my ssh key doing something like this: ssh -i my-key-name.pem root@insert ip of my instance here it fails. However, if I log in to the AWS console, click on my instance and select connect, it displays the instructions for SSHing into my instance (which are no different from the ssh command from above). So, if I rerun the SSH command from above, I'm able to log in. Next, if I try to rerun the spark-ec2 command from above (replacing launch with start), the script logs in and starts installing Spark. However, it eventually errors out with the following output: Cloning into 'spark-ec2'... remote: Counting objects: 1465, done. remote: Compressing objects: 100% (697/697), done. remote: Total 1465 (delta 485), reused 1465 (delta 485) Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done. Resolving deltas: 100% (485/485), done. Connection to ec2-my-clusters-ip.us-west-1.compute.amazonaws.com closed. Searching for existing cluster spark-test-cluster... Found 1 master(s), 1 slaves Starting slaves... Starting master... Waiting for instances to start up... Waiting 120 more seconds... Deploying files to master... Traceback (most recent call last): File ./spark_ec2.py, line 823, in module main() File ./spark_ec2.py, line 815, in main real_main() File ./spark_ec2.py, line 806, in real_main setup_cluster(conn, master_nodes, slave_nodes, opts, False) File ./spark_ec2.py, line 450, in setup_cluster deploy_files(conn, deploy.generic, opts, master_nodes, slave_nodes, modules) File ./spark_ec2.py, line 593, in deploy_files subprocess.check_call(command) File E:\windows_programs\Python27\lib\subprocess.py, line 535, in check_call retcode = call(*popenargs, **kwargs) File E:\windows_programs\Python27\lib\subprocess.py, line 522, in call return Popen(*popenargs, **kwargs).wait() File E:\windows_programs\Python27\lib\subprocess.py, line 710, in __init__ errread, errwrite) File E:\windows_programs\Python27\lib\subprocess.py, line 958, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified So, in short, am I missing something or is this a bug? Any help would be appreciated. Other notes: -I've tried both us-west-1 and us-east-1 regions. -I've tried several different instance types. -I've tried playing with the permissions on the ssh key (600, 400, etc.), but to no avail -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168989#comment-14168989 ] Anant Daksh Asthana edited comment on SPARK-3838 at 10/13/14 6:22 AM: -- Thanks [~mengxr] I will follow the instructions. I did also mention the coding guides are centered around Java/ Scala. was (Author: slcclimber): Thanks [~mengxr] I will follow the instructions. I did also mention the coding guides are centered around Java/ Scala. It would be nice to create one for Pyspark which colsely follows PEP-8. Python code example for Word2Vec in user guide -- Key: SPARK-3838 URL: https://issues.apache.org/jira/browse/SPARK-3838 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Anant Daksh Asthana Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168492#comment-14168492 ] Anant Daksh Asthana commented on SPARK-3838: I would like to contribute this example if no one has objections. Python code example for Word2Vec in user guide -- Key: SPARK-3838 URL: https://issues.apache.org/jira/browse/SPARK-3838 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Liquan Pei Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156082#comment-14156082 ] Anant Daksh Asthana commented on SPARK-3730: Thanks Patrick. On Wed, Oct 1, 2014 at 6:09 PM, Patrick Wendell (JIRA) j...@apache.org Any one else having building spark recently --- Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3725) Link to building spark returns a 404
Anant Daksh Asthana created SPARK-3725: -- Summary: Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152047#comment-14152047 ] Anant Daksh Asthana commented on SPARK-3725: Would it make sense to add a building spark document in the repo. This will make it easier to find documentation and any one who has the source will have the docs for it as well. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152269#comment-14152269 ] Anant Daksh Asthana commented on SPARK-3730: Definately not a spark issue. Just thought some one on here knew a solution. Any one else having building spark recently --- Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-951) Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141124#comment-14141124 ] Anant Daksh Asthana commented on SPARK-951: --- caizhua Could you please elaborate a little more on the issue? right now 'This code' and 'input file named Gmm_spark.tbl' are unknown to me at the time of reading this Gaussian Mixture Model -- Key: SPARK-951 URL: https://issues.apache.org/jira/browse/SPARK-951 Project: Spark Issue Type: Story Components: Examples Affects Versions: 0.7.3 Reporter: caizhua Priority: Critical Labels: Learning, Machine, Model This code includes the code for Gaussian Mixture Model. The input file named Gmm_spark.tbl is the input for this program. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136696#comment-14136696 ] Anant Daksh Asthana commented on SPARK-1486: That sounds very true and relevant. I am completely with you on this one. On Tue, Sep 16, 2014 at 5:50 PM, Xiangrui Meng (JIRA) j...@apache.org Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Critical It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs
[ https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047045#comment-14047045 ] Anant Daksh Asthana commented on SPARK-1945: Just looked at the code and I agree. Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Labels: Starter Fix For: 1.0.0 Right now some of the Java tabs only say the following: All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object. Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs
[ https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040796#comment-14040796 ] Anant Daksh Asthana commented on SPARK-1945: Michael, This issue refers to the examples provided for using MLib in Scala and Java. There is a lot more examples for Scala (https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples) vs the number of examples for Java(https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples). I have started tackling a few of them and would be great if we could team up and work on creating examples in java as well. Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Labels: Starter Fix For: 1.0.0 Right now some of the Java tabs only say the following: All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object. Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039355#comment-14039355 ] Anant Daksh Asthana commented on SPARK-2198: I am in agreement with Helena. Partition the scala build file so that it is easier to maintain --- Key: SPARK-2198 URL: https://issues.apache.org/jira/browse/SPARK-2198 Project: Spark Issue Type: Task Components: Build Reporter: Helena Edelson Priority: Minor Original Estimate: 3h Remaining Estimate: 3h Partition to standard Dependencies, Version, Settings, Publish.scala. keeping the SparkBuild clean to describe the modules and their deps so that changes in versions, for example, need only be made in Version.scala, settings changes such as in scalac in Settings.scala, etc. I'd be happy to do this ([~helena_e]) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs
[ https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032121#comment-14032121 ] Anant Daksh Asthana commented on SPARK-1945: I have started writing some of these examples in Java. Will make pull requests on github as I test them Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Labels: Starter Fix For: 1.0.0 Right now some of the Java tabs only say the following: All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object. Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2061) Deprecate `splits` in JavaRDDLike and add `partitions`
[ https://issues.apache.org/jira/browse/SPARK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028874#comment-14028874 ] Anant Daksh Asthana commented on SPARK-2061: Proposed fixed can be found at : https://github.com/apache/spark/pull/1062 Deprecate `splits` in JavaRDDLike and add `partitions` -- Key: SPARK-2061 URL: https://issues.apache.org/jira/browse/SPARK-2061 Project: Spark Issue Type: Bug Components: Java API Reporter: Patrick Wendell Assignee: Anant Daksh Asthana Priority: Minor Labels: starter Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2061) Deprecate `splits` in JavaRDDLike and add `partitions`
[ https://issues.apache.org/jira/browse/SPARK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027978#comment-14027978 ] Anant Daksh Asthana commented on SPARK-2061: Could I be added as the Assignee to this task? I am currently working on a fix. Deprecate `splits` in JavaRDDLike and add `partitions` -- Key: SPARK-2061 URL: https://issues.apache.org/jira/browse/SPARK-2061 Project: Spark Issue Type: Bug Components: Java API Reporter: Patrick Wendell Priority: Minor Labels: starter Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API. -- This message was sent by Atlassian JIRA (v6.2#6252)