Re: Hadoop LR comparison

2014-04-01 Thread DB Tsai
soon. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.comwrote: Hi, Is the code available for Hadoop to calculate the Logistic Regression hyperplane

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
with Spark in SF Machine Learning Meetup, please let me know. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: skip lines in spark

2014-04-23 Thread DB Tsai
What I suggested will not work if # of records you want to drop is more than the data in first partition. In my use-case, I only drop the first couple lines, so I don't have this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai
tolerance and data splitting to disk. It will be nice to have an API that we can do this type of book-keeping with native support. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Apr

Re: Running out of memory Naive Bayes

2014-04-28 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread DB Tsai
Your code is unformatted. Can u paste the whole file in gist and i can take a look for u. On Apr 28, 2014 10:42 PM, Earthson earthson...@gmail.com wrote: I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is

Re: string to int conversion

2014-05-02 Thread DB Tsai
You can drop header in csv by rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String]) = { if (partitionIdx == 0) { lines.drop(1) } lines } On May 2, 2014 6:02 PM, SK skrishna...@gmail.com wrote: 1) I have a csv file where one of the field has integer data but it

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread DB Tsai
breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency yourself. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-05 Thread DB Tsai
jar api like Yadid said. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 10:24 PM, wxhsdp wxh...@gmail.com wrote: Hi, DB, i think it's something related to sbt

Re: Spark LIBLINEAR

2014-05-12 Thread DB Tsai
It seems that the code isn't managed in github. Can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip It will be easier to track the changes in github. Sincerely, DB Tsai

Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
the loop have to be serializable. Since the for-loop is serializable in scala, I guess you have something non-serializable inside the for-loop. The while-loop in scala is native, so you won't have this issue if you use while-loop. Sincerely, DB Tsai

Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp, See https://github.com/scalanlp/breeze/issues/142 and https://github.com/fommil/netlib-java/issues/60 for details. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread DB Tsai
tomorrow. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng men...@gmail.com wrote: I don't know whether this would fix the problem

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread DB Tsai
honor it. I'm trying to figure out the problem now. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 14, 2014 at 5:46 AM, wxhsdp wxh...@gmail.com wrote: Hi, DB i've add breeze

Calling external classes added by sc.addJar needs to be through reflection

2014-05-16 Thread DB Tsai
will not be seen. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
in JVM. Users can use the classes directly. https://github.com/dbtsai/classloader-experiement/blob/master/calling/src/main/java/Calling3.java I'm now porting example 3) to Spark, and will let you know if it works. Thanks. Sincerely, DB Tsai

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
The jars are actually there (and in classpath), but you need to load through reflection. I've another thread giving the workaround. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
classNotFound exception. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 14, 2014 at 6:04 PM, Xiangrui Meng men...@gmail.com wrote: In SparkContext#addJar, for yarn

Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai
not serializable, it will raise exception. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 18, 2014 at 12:58 PM, Robert James srobertja...@gmail.comwrote: I see - I didn't realize that scope

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
/latest/mllib-optimization.html for detail. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Specifying executor

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
Hi Krishna, It should work, and we use it in production with great success. However, the constructor of LogisticRegressionModel is private[mllib], so you have to write your code, and have the package name under org.apache.spark.mllib instead of using scala console. Sincerely, DB Tsai

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog

Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
tracker for each operation will be very expensive. Is there a way to disable this behavior? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
What if there are multiple threads using the same spark context, will each of thread have it own UI? In this case, it will quickly run out of the ports. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai
Hi Aslan, Currently, we don't have the utility function to do so. However, you can easily implement this by another map transformation. I'm working on this feature now, and there will be couple different available normalization option users can chose. Sincerely, DB Tsai

Re: Normalizations in MLBase

2014-06-12 Thread DB Tsai
at 11:13 AM, Aslan Bekirov aslanbeki...@gmail.com wrote: Thanks a lot DB. I will try to do Znorm normalization using map transformation. BR, Aslan On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai dbt...@stanford.edu wrote: Hi Aslan, Currently, we don't have the utility function to do so

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai
Hi Congrui, Since it's private in mllib package, one workaround will be write your code in scala file with mllib package in order to use the constructor of LogisticRegressionModel. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: pyspark regression results way off

2014-06-16 Thread DB Tsai
Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write scala code, you can try LBFGS optimizer which converges better than GD. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai
Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My Blog: https

Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai
Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under package org.apache.spark.mllib Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, Does it mean that mapPartition and then reduce shares the same behavior as aggregate operation which is O(n)? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
)) } System.setProperty(SPARK_YARN_MODE, true) val sparkConf = new SparkConf val args = getArgsFromConf(conf) new Client(new ClientArguments(args, sparkConf), hadoopConfig, sparkConf).run Sincerely, DB Tsai --- My Blog: https

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
, etc. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote: db tsai, if in yarn-cluster mode the driver runs inside yarn, how

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai
There is no python binding for LBFGS. Feel free to submit a PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi mohitja...@gmail.com wrote

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-05 Thread DB Tsai
You may try LBFGS to have more stable convergence. In spark 1.1, we will be able to use LBFGS instead of GD in training process. On Jul 4, 2014 1:23 PM, Thomas Robert tho...@creativedata.fr wrote: Hi all, I too am having some issues with *RegressionWithSGD algorithms. Concerning your issue

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai
Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. For both yarn-cluster or yarn-client, Spark will distribute the jars through distributed cache

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai
spark-clinet mode runs driver in your application's JVM while spark-cluster mode runs driver in yarn cluster. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jul 7, 2014 at 5:44 PM

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
not be straightforward by just changing the version in spark build script. Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires Java 7. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote: I am run logistic regression with SGD on a problem with about 19M parameters (the kdda dataset from the libsvm library) I consistently see that

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
https://issues.apache.org/jira/browse/SPARK-2156 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S srikrishna...@gmail.com wrote: I am using

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai
, and sparse data is supported. It will be interesting to see new benchmark result. Anyone familiar with BIDMach? Are they as fast as they claim? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-02 Thread DB Tsai
I ran into this issue as well. The workaround by copying jar and ivy manually suggested by Shivaram works for me. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 1, 2014 at 3:31

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-10 Thread DB Tsai
Spark cached the RDD in JVM, so presumably, yes, the singleton trick should work. Sent from my Google Nexus 5 On Aug 9, 2014 11:00 AM, Kevin James Matzen kmat...@cs.cornell.edu wrote: I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup().

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
and there, so we're looking forward to your feedback, and please let us know what you think. We'll continue to improve it and we'll be adding Gradient Boosting in the near future as well. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Debasish, I didn't try one-vs-all vs softmax regression. One issue is that for one-vs-all, we have to train k classifiers for k classes problem. The training time will be k times longer. Sincerely, DB Tsai --- My Blog: https

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread DB Tsai
we have internal version requiring some cleanup for open source project. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread DB Tsai
For saving the memory, I recommend you compress the cached RDD, and it will be couple times smaller than original data sets. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-06 Thread DB Tsai
Yes. But you need to store RDD as *serialized* Java objects. See the session of storage level http://spark.apache.org/docs/latest/programming-guide.html Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread DB Tsai
. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Sep 13, 2014 at 2:12 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi All, I found that LogisticRegressionWithLBFGS interface

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread DB Tsai
by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used

Re: Fwd: Breeze Library usage in Spark

2014-10-03 Thread DB Tsai
You dont have to include breeze jar which is already in spark assembly jar. For native one, its optional. Sent from my Google Nexus 5 On Oct 3, 2014 8:04 PM, Priya Ch learnings.chitt...@gmail.com wrote: yes. I have included breeze-0.9 in build.sbt file. I ll change this to 0.7. Apart from

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Sep 29, 2014 at 11:45 AM, Yanbo Liang yanboha...@gmail.com wrote: Thank you for all your patient response. I can conclude that if the data

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread DB Tsai
- 1 until rdds.length) { temp = temp.unionAll(rdds(i)) } temp } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 13, 2014 at 7:22 PM, Nicholas Chammas nicholas.cham

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: why fetch failed

2014-10-20 Thread DB Tsai
here https://github.com/cloudera/spark/tree/cdh5-1.1.0_5.2.0 PS, I don't test it yet, but will test it in the following couple days, and report back. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: How to emit multiple keys for the same value?

2014-10-20 Thread DB Tsai
You can do this using flatMap which return a Seq of (key, value) pairs. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 20, 2014 at 9:31 AM, HARIPRIYA AYYALASOMAYAJULA aharipriy

Shuffle issues in the current master

2014-10-22 Thread DB Tsai
) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Sincerely, DB Tsai

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
It seems that this issue should be addressed by https://github.com/apache/spark/pull/2890 ? Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Oct 22, 2014 at 11:54 AM, DB

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Or can it be solved by setting both of the following setting into true for now? spark.shuffle.spill.compress true spark.shuffle.compress ture Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
PS, sorry for spamming the mailing list. Based my knowledge, both spark.shuffle.spill.compress and spark.shuffle.compress are default to true, so in theory, we should not run into this issue if we don't change any setting. Is there any other big we run into? Thanks. Sincerely, DB Tsai

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
We don't have SVMWithLBFGS, but you can check out how we implement LogisticRegressionWithLBFGS, and we also deal with some condition number improving stuff in LogisticRegressionWithLBFGS which improves the performance dramatically. Sincerely, DB Tsai

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
oh, we just train the model in the standardized space which will help the convergence of LBFGS. Then we convert the weights to original space so the whole thing is transparent to users. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
yeah, column normalizarion. for some of the datasets, without doing this, it will not be converged. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Oct 24, 2014 at 3:46 PM, Debasish

Re: Shuffle issues in the current master

2014-10-25 Thread DB Tsai
Hi Andrew, We were running the master after SPARK-3613. Will give another shot against the current master while Josh fixed couple issues in shuffle. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
/apache/spark/mllib/util/LocalSparkContext.scala as an example. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton bur...@spinn3r.com wrote: What’s

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog: https

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai

Re: Including data nucleus tools

2014-12-05 Thread DB Tsai
Can you try to run the same job using the assembly packaged by make-distribution as we discussed in the other thread. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai
You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
You need to do the StandardScaler to help the convergency yourself. LBFGS just takes whatever objective function you provide without doing any scaling. I will like to provide LinearRegressionWithLBFGS which does the scaling internally in the nearly feature. Sincerely, DB Tsai

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
the coefficients to the oringal space from the scaled space, the intercept can be computed by w0 = y - \sum x_n w_n where x_n is the average of column n. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
= scalerWithResponse.transform(rddVector).map(x= { (x(x.size - 1), Vectors.dense(x.toArray.slice(0, x.size -1)) }) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 12, 2014 at 12:23

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-15 Thread DB Tsai
want to break down which part of your code causes the issue to make debugging easier. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 11, 2014 at 4:48 AM, Muhammad Ahsan muhammad.ah

Re: Including data nucleus tools

2014-12-15 Thread DB Tsai
Just out of my curiosity. Do you manually apply this patch and see if this can actually resolve the issue? It seems that it was merged at some point, but reverted due to that it causes some stability issue. Sincerely, DB Tsai --- My Blog: https

Re: Effects problems in logistic regression

2014-12-22 Thread DB Tsai
Sounds great. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: Thanks again DB Tsai

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai
PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai
I would recommend to upload those jars to HDFS, and use add jars option in spark-submit with URI from HDFS instead of URI from local filesystem. Thus, it can avoid the problem of fetching jars from driver which can be a bottleneck. Sincerely, DB Tsai

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
it will cause problem for the algorithm. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote: Hello, I am new to spark streaming API. I wanted to ask if I can apply LBFGS

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai
We fixed couple issues in breeze LBFGS implementation. Can you try Spark 1.3 and see if they still exist? Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang c...@cjwang.us wrote: I

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai
Are you deploying the windows dll to linux machine? Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen davidshe...@gmail.com wrote: I think you meant to use the --files to deploy the DLLs. I gave

Re: foreachActive functionality

2015-01-25 Thread DB Tsai
PS, we were using Breeze's activeIterator originally as you can see in the old code, but we found there are overhead there, so we implement our own implementation which results 4x faster. See https://github.com/apache/spark/pull/3288 for detail. Sincerely, DB Tsai

Re: Features scaling

2015-04-21 Thread DB Tsai
Hi Denys, I don't see any issue in your python code, so maybe there is a bug in python wrapper. If it's in scala, I think it should work. BTW, LogsticRegressionWithLBFGS does the standardization internally, so you don't need to do it yourself. It worths giving it a try! Sincerely, DB Tsai

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai
the scaling and intercepts implicitly in objective function so no overhead of creating new transformed dataset. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Apr 29, 2015 at 1:21 AM, selim namsi selim.na...@gmail.com wrote: Thank

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread DB Tsai
Hi Xin, If you take a look at the model you trained, the intercept from Spark is significantly smaller than StatsModel, and the intercept represents a prior on categories in LOR which causes the low accuracy in Spark implementation. In LogisticRegressionWithLBFGS, the intercept is regularized due

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
? Thanks! On Thursday, June 4, 2015, DB Tsai dbt...@dbtsai.com wrote: By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar raghav0110...@gmail.com wrote: Hey Reza, Thanks for your response

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
. On Jun 3, 2015, at 9:53 PM, DB Tsai dbt...@dbtsai.com javascript:_e(%7B%7D,'cvml','dbt...@dbtsai.com'); wrote: Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please

Re: Implementing top() using treeReduce()

2015-06-09 Thread DB Tsai
} }.toArray.sorted(ord) } } } def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { treeTakeOrdered(num)(ord.reverse) } Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D https://pgp.mit.edu

Re: Linear Regression with SGD

2015-06-09 Thread DB Tsai
As Robin suggested, you may try the following new implementation. https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D https

Re: Missing values support in Mllib yet?

2015-06-19 Thread DB Tsai
Not really yet. But at work, we do GBDT missing values imputation, so I've the interest to port them to mllib if I have enough time. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Jun 19, 2015 at 1:23 PM

  1   2   >