Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-02 Thread DB Tsai
I ran into this issue as well. The workaround by copying jar and ivy manually suggested by Shivaram works for me. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 1, 2014 at 3:31

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai
, and sparse data is supported. It will be interesting to see new benchmark result. Anyone familiar with BIDMach? Are they as fast as they claim? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote: I am run logistic regression with SGD on a problem with about 19M parameters (the kdda dataset from the libsvm library) I consistently see that

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
https://issues.apache.org/jira/browse/SPARK-2156 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S srikrishna...@gmail.com wrote: I am using

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
not be straightforward by just changing the version in spark build script. Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires Java 7. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai
Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. For both yarn-cluster or yarn-client, Spark will distribute the jars through distributed cache

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai
spark-clinet mode runs driver in your application's JVM while spark-cluster mode runs driver in yarn cluster. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jul 7, 2014 at 5:44 PM

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-05 Thread DB Tsai
You may try LBFGS to have more stable convergence. In spark 1.1, we will be able to use LBFGS instead of GD in training process. On Jul 4, 2014 1:23 PM, Thomas Robert tho...@creativedata.fr wrote: Hi all, I too am having some issues with *RegressionWithSGD algorithms. Concerning your issue

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai
There is no python binding for LBFGS. Feel free to submit a PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi mohitja...@gmail.com wrote

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
)) } System.setProperty(SPARK_YARN_MODE, true) val sparkConf = new SparkConf val args = getArgsFromConf(conf) new Client(new ClientArguments(args, sparkConf), hadoopConfig, sparkConf).run Sincerely, DB Tsai --- My Blog: https

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
, etc. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote: db tsai, if in yarn-cluster mode the driver runs inside yarn, how

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, Does it mean that mapPartition and then reduce shares the same behavior as aggregate operation which is O(n)? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17

Re: pyspark regression results way off

2014-06-16 Thread DB Tsai
Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write scala code, you can try LBFGS optimizer which converges better than GD. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai
Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My Blog: https

Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai
Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under package org.apache.spark.mllib Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai
Hi Congrui, Since it's private in mllib package, one workaround will be write your code in scala file with mllib package in order to use the constructor of LogisticRegressionModel. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Normalizations in MLBase

2014-06-12 Thread DB Tsai
at 11:13 AM, Aslan Bekirov aslanbeki...@gmail.com wrote: Thanks a lot DB. I will try to do Znorm normalization using map transformation. BR, Aslan On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai dbt...@stanford.edu wrote: Hi Aslan, Currently, we don't have the utility function to do so

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai
Hi Aslan, Currently, we don't have the utility function to do so. However, you can easily implement this by another map transformation. I'm working on this feature now, and there will be couple different available normalization option users can chose. Sincerely, DB Tsai

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
tracker for each operation will be very expensive. Is there a way to disable this behavior? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
What if there are multiple threads using the same spark context, will each of thread have it own UI? In this case, it will quickly run out of the ports. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
/latest/mllib-optimization.html for detail. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Specifying executor

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
Hi Krishna, It should work, and we use it in production with great success. However, the constructor of LogisticRegressionModel is private[mllib], so you have to write your code, and have the package name under org.apache.spark.mllib instead of using scala console. Sincerely, DB Tsai

Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai
not serializable, it will raise exception. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 18, 2014 at 12:58 PM, Robert James srobertja...@gmail.comwrote: I see - I didn't realize that scope

Calling external classes added by sc.addJar needs to be through reflection

2014-05-16 Thread DB Tsai
will not be seen. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
in JVM. Users can use the classes directly. https://github.com/dbtsai/classloader-experiement/blob/master/calling/src/main/java/Calling3.java I'm now porting example 3) to Spark, and will let you know if it works. Thanks. Sincerely, DB Tsai

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
The jars are actually there (and in classpath), but you need to load through reflection. I've another thread giving the workaround. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread DB Tsai
classNotFound exception. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 14, 2014 at 6:04 PM, Xiangrui Meng men...@gmail.com wrote: In SparkContext#addJar, for yarn

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread DB Tsai
honor it. I'm trying to figure out the problem now. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 14, 2014 at 5:46 AM, wxhsdp wxh...@gmail.com wrote: Hi, DB i've add breeze

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread DB Tsai
tomorrow. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng men...@gmail.com wrote: I don't know whether this would fix the problem

Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
the loop have to be serializable. Since the for-loop is serializable in scala, I guess you have something non-serializable inside the for-loop. The while-loop in scala is native, so you won't have this issue if you use while-loop. Sincerely, DB Tsai

Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp, See https://github.com/scalanlp/breeze/issues/142 and https://github.com/fommil/netlib-java/issues/60 for details. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13

Re: Spark LIBLINEAR

2014-05-12 Thread DB Tsai
It seems that the code isn't managed in github. Can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip It will be easier to track the changes in github. Sincerely, DB Tsai

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-05 Thread DB Tsai
jar api like Yadid said. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 10:24 PM, wxhsdp wxh...@gmail.com wrote: Hi, DB, i think it's something related to sbt

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread DB Tsai
breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency yourself. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh

Re: string to int conversion

2014-05-02 Thread DB Tsai
You can drop header in csv by rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String]) = { if (partitionIdx == 0) { lines.drop(1) } lines } On May 2, 2014 6:02 PM, SK skrishna...@gmail.com wrote: 1) I have a csv file where one of the field has integer data but it

Re: Running out of memory Naive Bayes

2014-04-28 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread DB Tsai
Your code is unformatted. Can u paste the whole file in gist and i can take a look for u. On Apr 28, 2014 10:42 PM, Earthson earthson...@gmail.com wrote: I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai
tolerance and data splitting to disk. It will be nice to have an API that we can do this type of book-keeping with native support. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Apr

Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: skip lines in spark

2014-04-23 Thread DB Tsai
What I suggested will not work if # of records you want to drop is more than the data in first partition. In my use-case, I only drop the first couple lines, so I don't have this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
with Spark in SF Machine Learning Meetup, please let me know. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: Hadoop LR comparison

2014-04-01 Thread DB Tsai
soon. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.comwrote: Hi, Is the code available for Hadoop to calculate the Logistic Regression hyperplane

<    1   2