Re: Apache Spark (Data Aggregation) using Java API

2014-12-25 Thread slcclimber
You could convert your csv file to an rdd of vectors. Then use stats from mllib. Also this should be in the user list not the developer list. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-Data-Aggregation-using-Java-API-tp9902p9924.html

Re: Announcing Spark 1.2!

2014-12-21 Thread slcclimber
Congratulations. This is quite exciting. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Announcing-Spark-1-2-tp9847p9874.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread slcclimber
I am building and testing using sbt. I get a lot of "Job aborted due to stage failure: Master removed our application: FAILED" did not contain "cancelled", and "Job aborted due to stage failure: Master removed our application: FAILED" did not contain "killed" errors trying to run tests. (JobCance

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-29 Thread slcclimber
+1 1> Compiled binaries 2> All Tests Pass 3> Ran python and scala examples for spark and Mllib on local and master + 4 workers -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC1-tp9546p9552.html Sent from the Apache Sp

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-20 Thread slcclimber
hat you can add code in > it. > > > > > > Thanks, > > > > Ashutosh > > > > > > From: slcclimber [via Apache Spark Developers List] < > > [hidden email] <http://user/SendEmail.jtp?type=node&node=9467&i=1>> > > Sent: Thurs

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread slcclimber
+1 Built successfully and ran the python examples. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC2-tp9439p9452.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-19 Thread slcclimber
You could also use rdd.zipWithIndex() to create indexes. Anant -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html Sent from the Apache Spark Developers List mailing list archive at Nabble

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-16 Thread slcclimber
Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would be to use some thing along these lines: val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size) val rddWithIndex = rdd.z

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-11 Thread slcclimber
n > > > > > We should take a vector instead giving the user flexibility to decide > > data source/ type > > What do you mean by vector datatype exactly? > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-04 Thread slcclimber
park ยท GitHub > Contribute to Outlier-Detection-with-AVF-Spark development by creating an > account on GitHub. > Read more... > <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala> > > --

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-30 Thread slcclimber
Ashutosh, A vector would be a good idea vectors are used very frequently. Test data is usually stored in the spark/data/mllib folder On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" < ml-node+s1001551n9034...@n3.nabble.com> wrote: > Hi Anant, > sorry for my late reply. Than

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-28 Thread slcclimber
Ashu, There is one main issue and a few stylistic/ grammatical things I noticed. 1> You take and rdd or type String which you expect to be comma separated. This limits usability since the user will have to convert their RDD to that format only for you to split it on string. It would make more sens