Re: spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Feynman Liang
What is the type of unlabeledTest? SQL should be using the VectorUDT we've defined for Vectors so you should be able to just "import sqlContext.implicits._" and then call

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
n > to the cached results. Let's say if we implemented an "equal" method for > param1. By comparing param1 with the previous run, the program will know > data1 is reusable. And time used for generating data1 can be saved. > > Best, > Lewis > > 2015-09-15 23:05

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Say if we run "searchRun()" for 2 times with the same "param1" but > different "param2". Will spark recognize that the two local variables > "data1" in consecutive runs has the same content? > > > Best, > Lewis > > 2015-09-15 13:58 GMT+08

Re: How to speed up MLlib LDA?

2015-09-15 Thread Feynman Liang
Hi Marko, I haven't looked into your case in much detail but one immediate thought is: have you tried the OnlineLDAOptimizer? It's implementation and resulting LDA model (LocalLDAModel) is quite different (doesn't depend on GraphX, assumes the model fits on a single machine) so you may see

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis, Many pipeline stages implement save/load methods, which can be used if you instantiate and call the underlying pipeline stages `transform` methods individually (instead of using the Pipeline.setStages API). See associated JIRAs . Pipeline

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
> transformations, not transformer themselves. Do you know any related dev. > activities or plans? > > Best, > Lewis > > 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: > >> Lewis, >> >> Many pipeline stages implement save/load method

Re: Training the MultilayerPerceptronClassifier

2015-09-13 Thread Feynman Liang
; Registered number: 02675207. > Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 > 7DY, UK. > -- > *From:* Feynman Liang [fli...@databricks.com] > *Sent:* 11 September 2015 20:34 > *To:* Rory Waite > *Cc:* user@spark.apache.org > *Subject:* Re: Training t

Re: Model summary for linear and logistic regression.

2015-09-11 Thread Feynman Liang
Sorry! The documentation is not the greatest thing in the world, but these features are documented here On Fri, Sep 11, 2015 at 6:25 AM, Sebastian Kuepers < sebastian.kuep...@publicispixelpark.de> wrote: > Hey, > > > the 1.5.0 release

Re: Training the MultilayerPerceptronClassifier

2015-09-11 Thread Feynman Liang
Rory, I just sent a PR (https://github.com/avulanov/ann-benchmark/pull/1) to bring that benchmark up to date. Hope it helps. On Fri, Sep 11, 2015 at 6:39 AM, Rory Waite wrote: > Hi, > > I’ve been trying to train the new MultilayerPerceptronClassifier in spark > 1.5 for the

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Feynman Liang
Spark notebook does something similar, take a look at their line chart code On Fri, Sep 11, 2015 at 8:56 AM, Shashi Vishwakarma < shashi.vish...@gmail.com> wrote: > Hi > > I have

Re: Spark ANN

2015-09-09 Thread Feynman Liang
ered as well http://arxiv.org/pdf/1312.5851.pdf. > > > > *From:* Feynman Liang [mailto:fli...@databricks.com] > *Sent:* Tuesday, September 08, 2015 12:07 PM > *To:* Ulanov, Alexander > *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru > *Subject:* Re: Spark ANN &g

Re: Spark ANN

2015-09-08 Thread Feynman Liang
r > > *From:* Ruslan Dautkhanov [mailto:dautkha...@gmail.com] > *Sent:* Monday, September 07, 2015 10:09 PM > *To:* Feynman Liang > *Cc:* Nick Pentreath; user; na...@yandex.ru > *Subject:* Re: Spark ANN > > > > Found a dropout commit from avulanov: > > > https://g

Re: Spark ANN

2015-09-07 Thread Feynman Liang
stava14a.pdf > https://cs.nyu.edu/~wanli/dropc/dropc.pdf > > ps. There is a small copy-paste typo in > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43 > should read B :) > > > > -- > Ruslan Dautkhanov > &g

Re: Spark ANN

2015-09-07 Thread Feynman Liang
BTW thanks for pointing out the typos, I've included them in my MLP cleanup PR <https://github.com/apache/spark/pull/8648> On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang <fli...@databricks.com> wrote: > Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is o

Re: Spark ANN

2015-09-07 Thread Feynman Liang
Backprop is used to compute the gradient here , which is then optimized by SGD or LBFGS here

Re: How to generate spark assembly (jar file) using Intellij

2015-08-29 Thread Feynman Liang
Have you tried `build/sbt assembly`? On Sat, Aug 29, 2015 at 9:03 PM, Muler mulugeta.abe...@gmail.com wrote: Hi guys, I can successfully build Spark using Intellij, but I'm not able to locate/generate spark assembly (jar file) in the assembly/target directly) How do I generate one? I have

Re: Spark MLLIB multiclass calssification

2015-08-29 Thread Feynman Liang
I would check out the Pipeline code example https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net wrote: Hi! I want to implement a multiclass classification for documents. So I have different kinds of text

Re: Spark MLLIB multiclass calssification

2015-08-29 Thread Feynman Liang
(1L, b d, 0.0), new LabeledDocument(2L, hadoop f g h, 2.0), On Sun, Aug 30, 2015 at 7:32 AM, Feynman Liang fli...@databricks.com wrote: I would check out the Pipeline code example https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Sat, Aug 29, 2015 at 9:23 PM

Re: MLlib Prefixspan implementation

2015-08-26 Thread Feynman Liang
in inverse order just to be reversed when transformed to a sequence. 2015-08-25 12:15 GMT+08:00 Feynman Liang fli...@databricks.com: CCing the mailing list again. It's currently not on the radar. Do you have a use case for it? I can bring it up during 1.6 roadmap planning tomorrow. On Mon, Aug

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
jatinpr...@gmail.com wrote: Hi Feynman, Thanks for the information. Is there a way to depict decision tree as a visualization for large amounts of data using any other technique/library? Thanks, Jatin On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang fli...@databricks.com wrote: Nothing

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
Nothing is in JIRA https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22 so AFAIK no, only random forests and GBTs using entropy or GINI for information gain is supported. On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Feynman Liang
Kristina, Thanks for the discussion. I followed up on your problem and learned that Scala doesn't support multiple implicit conversions in a single expression http://stackoverflow.com/questions/8068346/can-scala-apply-multiple-implicit-conversions-in-one-expression for complexity reasons. I'm

Re: MLlib Prefixspan implementation

2015-08-24 Thread Feynman Liang
and not in the code so I guess you didn't use this result. Do you plan to implement sequence with timestamp and gap constraint as in : https://people.mpi-inf.mpg.de/~rgemulla/publications/miliaraki13mg-fsm.pdf 2015-08-25 7:06 GMT+08:00 Feynman Liang fli...@databricks.com: Hi Alexis

Re: mllib on (key, Iterable[Vector])

2015-08-11 Thread Feynman Liang
You could try flatMapping i.e. if you have data : RDD[(key, Iterable[Vector])] then data.flatMap(_._2) : RDD[Vector], which can be GMMed. If you want to first partition by url, I would first create multiple RDDs using `filter`, then running GMM on each of the filtered rdds. On Tue, Aug 11, 2015

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Sounds reasonable to me, feel free to create a JIRA (and PR if you're up for it) so we can see what others think! On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler gerald.loeff...@googlemail.com wrote: hi, if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0, doesn’t that make it

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
. So with fraction=1, then all the rows will be sampled exactly once to form the miniBatch, resulting to the deterministic/classical case. On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang fli...@databricks.com wrote: Sounds reasonable to me, feel free to create a JIRA (and PR if you're up

Re: Spark MLib v/s SparkR

2015-08-07 Thread Feynman Liang
SparkR and MLlib are becoming more integrated (we recently added R formula support) but the integration is still quite small. If you learn R and SparkR, you will not be able to leverage most of the distributed algorithms in MLlib (e.g. all the algorithms you cited). However, you could use the

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
:24 AM, Feynman Liang fli...@databricks.com wrote: Yep, I think that's what Gerald is saying and they are proposing to default miniBatchFraction = (1 / numInstances). Is that correct? On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu rotationsymmetr...@gmail.com wrote: I think in the SGD

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Hi Hayri, Can you provide a sample of the expected and actual results? Feynman On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com wrote: The results in MulticlassMetrics is totally wrong. They are improperly calculated. Confusion matrix may be true I don't know but for

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Also, what version of Spark are you using? On Wed, Aug 5, 2015 at 9:57 AM, Feynman Liang fli...@databricks.com wrote: Hi Hayri, Can you provide a sample of the expected and actual results? Feynman On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com wrote: The results

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
code... I attached a document for my reuters tests on page 3. On Wed, Aug 5, 2015 at 7:57 PM, Feynman Liang fli...@databricks.com wrote: Also, what version of Spark are you using? On Wed, Aug 5, 2015 at 9:57 AM, Feynman Liang fli...@databricks.com wrote: Hi Hayri, Can you provide

Re: [Spark ML] HasInputCol, etc.

2015-07-28 Thread Feynman Liang
Unfortunately, AFAIK custom transformers are not part of the public API so you will have to continue with what you're doing. On Tue, Jul 28, 2015 at 1:32 PM, Matt Narrell matt.narr...@gmail.com wrote: Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom

Re: LDA on a large dataset

2015-07-20 Thread Feynman Liang
LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix of summary statistics. I suspect that this is what's causing the failure. One thing you may try doing is decreasing the vocabulary size. One possibility would be to use a HashingTF if you don't mind dimension reduction via

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
. Given the above two cases, is using MultivariateOnlineSummarizer not a good idea then? Anupam Bagchi On Jul 13, 2015, at 7:06 PM, Feynman Liang fli...@databricks.com wrote: Dimensions mismatch when adding new sample. Expecting 8 but got 14. Make sure all the vectors you are summarizing over

Re: Few basic spark questions

2015-07-14 Thread Feynman Liang
, 2015 at 12:35 AM, Feynman Liang fli...@databricks.com wrote: Sorry; I think I may have used poor wording. SparkR will let you use R to analyze the data, but it has to be loaded into memory using SparkR (see SparkR DataSources http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
saveAsTextFile() on it. Anupam Bagchi (c) 408.431.0780 (h) 408-873-7909 On Jul 13, 2015, at 4:52 PM, Feynman Liang fli...@databricks.com wrote: A good example is RegressionMetrics https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
Hi Oded, I'm not sure I completely understand your question, but it sounds like you could have the READER receiver produce a DStream which is windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT. However, streaming in SparkR is not currently supported (SPARK-6803

Re: Spark issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Feynman Liang
Can you send the error messages again? I'm not seeing them. On Mon, Jul 13, 2015 at 2:45 AM, shivamverma shivam13ve...@gmail.com wrote: Hi I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS node. I am trying to run grid search on an RF classifier to classify a small

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong way. allaggregates.toArray allocates and creates a new array separate from allaggregates which is sorted by Sorting.quickSort; allaggregates. Try: val sortedAggregates = allaggregates.toArray

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
guess not because it is a streaming command, right? Any other way to window the data? Sent from IPhone On Mon, Jul 13, 2015 at 2:07 PM -0700, Feynman Liang fli...@databricks.com wrote: If you use SparkR then you can analyze the data that's currently in memory with R; otherwise you

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
me how to write the ‘foreach’ loop in a Spark-friendly way? Thanks a lot for your help. Anupam Bagchi On Jul 13, 2015, at 12:21 PM, Feynman Liang fli...@databricks.com wrote: The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong way. allaggregates.toArray

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Feynman Liang
This might be a bug... R^2 should always be in [0,1] and variance should never be negative. Can you give more details on which version of Spark you are running? On Sun, Jul 12, 2015 at 8:37 AM, Sean Owen so...@cloudera.com wrote: In general, R2 means the line that was fit is a very poor fit --

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread Feynman Liang
A RDD[Double] is an abstraction for a large collection of doubles, possibly distributed across multiple nodes. The DoubleRDDFunctions are there for performing mean and variance calculations across this distributed dataset. In contrast, a Vector is not distributed and fits on your local machine.

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Feynman Liang
Take a look at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann daniel.siegm...@teamaol.com wrote: To set up Eclipse for Spark you should install the Scala IDE plugins:

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Feynman Liang
I was thinking the same thing! Try sc.setLogLevel(ERROR) On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson lat...@microsoft.com wrote: “WARN Executor: Told to re-register on heartbeat” is logged repeatedly in the spark shell, which is very distracting and corrupts the display of whatever set

Re: How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread Feynman Liang
Hi MIssie, In the Java API, you should consider: 1. RDD.map https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#map(scala.Function1,%20scala.reflect.ClassTag) to transform the text 2. RDD.sortBy

Re: Random Forest in MLLib

2015-07-06 Thread Feynman Liang
Not yet, though work on this feature has begun (SPARK-5133 https://issues.apache.org/jira/browse/SPARK-5133) On Mon, Jul 6, 2015 at 4:46 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Is there a way to get variable importance for RandomForest model created using MLLib ? This way

Re: KMeans questions

2015-07-02 Thread Feynman Liang
SPARK-7879 https://issues.apache.org/jira/browse/SPARK-7879 seems to address your use case (running KMeans on a dataframe and having the results added as an additional column) On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman eric.d.fried...@gmail.com wrote: In preparing a DataFrame (spark 1.4) to

Re: sliding

2015-07-02 Thread Feynman Liang
= Event(s(0).time, (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0)) and that is working. Anyway this is not what I wanted to do, my goal was more to implement bucketing to shorten the time serie. On 2 July 2015 at 18:25, Feynman Liang fli...@databricks.com wrote: What's

Re: sliding

2015-07-02 Thread Feynman Liang
, Feynman Liang fli...@databricks.com wrote: How about: events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0) That would group the RDD into adjacent buckets of size 3. On Thu, Jul 2, 2015 at 2:33 PM, tog guillaume.all...@gmail.com wrote: Was complaining about the Seq ... Moved it to val

Re: custom RDD in java

2015-07-01 Thread Feynman Liang
saveAsTextFile on customRDD directly to save in hdfs. On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang fli...@databricks.com wrote: On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora shushantaror...@gmail.com wrote: JavaRDDString rdd = javasparkcontext.parllelise(tables); You are already

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Feynman Liang
You are trying to predict on a DStream[LabeledPoint] (data + labels) but predictOn expects a DStream[Vector] (just the data without the labels). Try doing: val unlabeledStream = labeledStream.map { x = x.features } model.predictOn(unlabeledStream).print() On Sun, Jun 28, 2015 at 6:03 PM, Arthur