You could convert your csv file to an rdd of vectors.
Then use stats from mllib.
Also this should be in the user list not the developer list.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-Data-Aggregation-using-Java-API-tp9902p9924.html
Congratulations. This is quite exciting.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Announcing-Spark-1-2-tp9847p9874.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
-
I am building and testing using sbt.
I get a lot of
"Job aborted due to stage failure: Master removed our application: FAILED"
did not contain "cancelled", and "Job aborted due to stage failure: Master
removed our application: FAILED" did not contain "killed"
errors trying to run tests. (JobCance
+1
1> Compiled binaries
2> All Tests Pass
3> Ran python and scala examples for spark and Mllib on local and master + 4
workers
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC1-tp9546p9552.html
Sent from the Apache Sp
hat you can add code in
> it.
> >
> >
> > Thanks,
> >
> > Ashutosh
> >
> >
> > From: slcclimber [via Apache Spark Developers List] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=9467&i=1>>
> > Sent: Thurs
+1
Built successfully and ran the
python examples.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC2-tp9439p9452.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---
You could also use rdd.zipWithIndex() to create indexes.
Anant
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9441.html
Sent from the Apache Spark Developers List mailing list archive at Nabble
Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are
used specially over massive datasets.
A better approach would be to use some thing along these lines:
val index = sc.parallelize(Range.Long(0, rdd.count, 1),
rdd.partitions.size)
val rddWithIndex = rdd.z
n
>
> >
> > We should take a vector instead giving the user flexibility to decide
> > data source/ type
>
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter
park ยท GitHub
> Contribute to Outlier-Detection-with-AVF-Spark development by creating an
> account on GitHub.
> Read more...
> <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>
>
> --
Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder
On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" <
ml-node+s1001551n9034...@n3.nabble.com> wrote:
> Hi Anant,
> sorry for my late reply. Than
Ashu,
There is one main issue and a few stylistic/ grammatical things I noticed.
1> You take and rdd or type String which you expect to be comma separated.
This limits usability since the user will have to convert their RDD to that
format only for you to split it on string.
It would make more sens
12 matches
Mail list logo