Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread DB Tsai
Thank you, Huaxin for the 3.2.1 release! Sent from my iPhone > On Jan 28, 2022, at 5:45 PM, Chao Sun wrote: > >  > Thanks Huaxin for driving the release! > >> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng wrote: >> It's Great! >> Congrats and thanks, huaxin! >> >> >> --

Re: Spark 3.0.1 not connecting with Hive 2.1.1

2021-01-09 Thread DB Tsai
Hi Pradyumn, I think it’s because of a HMS client backward compatibility issue described here, https://issues.apache.org/jira/browse/HIVE-24608 Thanks, DB Tsai | ACI Spark Core |  Apple, Inc > On Jan 9, 2021, at 9:53 AM, Pradyumn Agrawal wrote: > > Hi Michael, > Thanks fo

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
S/Hive running on Hadoop 2.6 ? > > Best Regards, -- Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5B25A8F7A82C1 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
teHadoopClasspath" is used in YARN mode correct? > However our Spark cluster is standalone cluster not using YARN. > We only connect to HDFS/Hive to access data.Computation is done on our spark > cluster running on K8s (not Yarn) > > > On Mon, Jul 20, 2020 at 2:04 PM DB Tsa

Re: JDK11 Support in Apache Spark

2019-08-24 Thread DB Tsai
Congratulations on the great work! Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5B25A8F7A82C1 On Sat, Aug 24, 2019 at 8:11 AM Dongjoon Hyun wrote: > > Hi, All. > > Thanks to your many many contributions, &g

Re: Release Apache Spark 2.4.4

2019-08-13 Thread DB Tsai
+1 On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun wrote: > > Hi, All. > > Spark 2.4.3 was released three months ago (8th May). > As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` > since 2.4.3. > > It would be great if we can have Spark 2.4.4. > Shall we start `2.4.4

[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai
+user list We are happy to announce the availability of Spark 2.4.1! Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4.0 users to upgrade to this stable release. In Apache Spark 2.4.1, Scala 2.12 support is GA, and

Re: imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread DB Tsai
We have the weighting algorithms implemented in linear models, but unfortunately, it's not implemented in tree models. It's an important feature, and welcome for PR! Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID

Re: Why dataframe can be more efficient than dataset?

2017-04-13 Thread DB Tsai
There is a JIRA and prototype which analyzes the JVM bytecode in the black box, and convert the closures into catalyst expressions. https://issues.apache.org/jira/browse/SPARK-14083 This potentially can address the issue discussed here. Sincerely, DB Tsai

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread DB Tsai
Hi Jong, I think the definition from Kaggle is correct. I'm working on implementing ranking metrics in Spark ML now, but the timeline is unknown. Feel free to submit a PR for this in MLlib. Thanks. Sincerely, DB Tsai -- Web: https

Re: SPARK ML- Feature Selection Techniques

2016-09-06 Thread DB Tsai
You can try LOR with L1. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Sep 5, 2016 at 5:31 AM, Bahubali Jain <bahub...@gmail.com> wrote: > Hi, > Do we have any feature selection techniques im

Re: a question about LBFGS in Spark

2016-08-24 Thread DB Tsai
the regularization part of gradient. // Will add the gradientSum computed from the data with weights in the next step. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D >> On Wed, Aug 24, 2016 at 7:16 AM Lingling Li

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread DB Tsai
+1 for renaming the jar file. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly <ch...@fregly.com> wrote: > perhaps renaming to Spark ML would actually clea

Re: Incomplete data when reading from S3

2016-03-19 Thread DB Tsai
You need to use wholetextfiles to read the whole file at once. Otherwise, It can be split. DB Tsai - Sent From My Phone On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" <snud...@gmail.com> wrote: > Hi. > > We have json data stored in S3 (json record per line). When reading

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread DB Tsai
Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu <zchl.j...@yahoo.

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread DB Tsai
This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu <zchl.j...@yahoo.

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread DB Tsai
://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Nov 17, 2015 at 4:11 PM, njoshi

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread DB Tsai
to be small enough to return the result to users within reasonable latency, so I doubt how usefulness of the distributed models in real production use-case. For R and Python, we can build a wrapper on-top of the lightweight "spark-ml-common" project. Sincerely

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
This will bring the whole dependencies of spark will may break the web app. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> wrote: > &

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
nity, we need to address this. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen <so...@cloudera.com> wrote: > This is all starting to sound a lot like what's alread

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai
Do you think it will be useful to separate those models and model loader/writer code into another spark-ml-common jar without any spark platform dependencies so users can load the models trained by Spark ML in their application and run the prediction? Sincerely, DB Tsai

Re: [Spark MLlib] about linear regression issue

2015-11-01 Thread DB Tsai
ear regression, but currently, there is no open source implementation in Spark. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Nov 1, 2015 at 9:22 AM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote: > Dear All,

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
shrinkage). Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <rotationsymmetr...@gmail.com> wrote: > Hi DB Tsai, > > Thank you very much fo

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do you think you can implement generic GBM and have it merged as part of Spark codebase? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote: > Interesting. For feature sub-sampling,

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
Column 4 is always constant, so no predictive power resulting zero weight. On Sunday, October 25, 2015, Zhiliang Zhu <zchl.j...@yahoo.com> wrote: > Hi DB Tsai, > > Thanks very much for your kind reply help. > > As for your comment, I just modified and tested the

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
LinearRegressionWithSGD is not stable. Please use linear regression in ML package instead. http://spark.apache.org/docs/latest/ml-linear-methods.html Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Oct 25

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai
those code to share more.) Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <javeli...@gmail.com>

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai
Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps. Sincerely, DB Tsai -- Blog: https

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Could you paste some of your code for diagnosis? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zh

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Your code looks correct for me. How many # of features do you have in this training? How many tasks are running in the job? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?sea

Re: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
in ./apps/mesos-0.22.1/sbin/mesos-daemon.sh #!/usr/bin/env bash prefix=/apps/mesos-0.22.1 exec_prefix=/apps/mesos-0.22.1 deploy_dir=${prefix}/etc/mesos # Increase the default number of open file descriptors. ulimit -n 8192 Sincerely, DB Tsai

Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
= rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Hope this can help someone in the same situation. Sincerely, DB Tsai -- Blog: https://www.

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
don't see it explicitly, but the code is in line 128. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Jun 23, 2015 at 3:14 PM, Wei Zhou zhweisop...@gmail.com wrote: Hi DB Tsai, Thanks for your reply. I went

Re: Missing values support in Mllib yet?

2015-06-19 Thread DB Tsai
Not really yet. But at work, we do GBDT missing values imputation, so I've the interest to port them to mllib if I have enough time. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Jun 19, 2015 at 1:23 PM

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-19 Thread DB Tsai
. Here is the talk I gave in Spark summit about the new elastic-net feature in ML. I will encourage you to try the one ML. http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit Sincerely, DB Tsai

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
You need to build the spark assembly with your modification and deploy into cluster. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar raghav0110...@gmail.com wrote

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
all of them. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:15 PM, Raghav Shankar raghav0110...@gmail.com wrote: So, I would add the assembly jar to the just the master or would I have

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
as you see necessary. Thanks, Sauptik. -Original Message- From: DB Tsai Sent: Tuesday, June 16, 2015 2:08 PM To: Ramakrishnan Naveen (CR/RTC1.3-NA) Cc: Dhar Sauptik (CR/RTC1.3-NA) Subject: Re: FW: MLLIB (Spark) Question. Hey, In the LORWithLBFGS api you use, the intercept

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
Hi Dhar, For standardization, we can disable it effectively by using different regularization on each component. Thus, we're solving the same problem but having better rate of convergence. This is one of the features I will implement. Sincerely, DB Tsai

Re: Implementing top() using treeReduce()

2015-06-09 Thread DB Tsai
} }.toArray.sorted(ord) } } } def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { treeTakeOrdered(num)(ord.reverse) } Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D https://pgp.mit.edu

Re: Linear Regression with SGD

2015-06-09 Thread DB Tsai
As Robin suggested, you may try the following new implementation. https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D https

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
? Thanks! On Thursday, June 4, 2015, DB Tsai dbt...@dbtsai.com wrote: By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar raghav0110...@gmail.com wrote: Hey Reza, Thanks for your response

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
. On Jun 3, 2015, at 9:53 PM, DB Tsai dbt...@dbtsai.com javascript:_e(%7B%7D,'cvml','dbt...@dbtsai.com'); wrote: Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-31 Thread DB Tsai
of interest? -- Weights are calculated as with all logistic regression algorithms, by using convex optimization to minimize a regularized log loss. Good luck! Joseph On Fri, May 22, 2015 at 1:07 PM, DB Tsai dbt...@dbtsai.com wrote: In Spark 1.4, Logistic Regression with elasticNet

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread DB Tsai
the result from R. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, May 27, 2015 at 9:08 PM, Maheshakya Wijewardena mahesha...@wso2.com wrote: Hi, I'm trying to use Sparks' LinearRegressionWithSGD in PySpark with the attached dataset

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes, if the executor dies, all the concurrent running tasks will be gone. We would like to have multiple executors in one node but can not figure out a way to do it in

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
Typo. We can not figure a way to increase the number of executor in one node in mesos. On Wednesday, May 27, 2015, DB Tsai dbt...@dbtsai.com wrote: If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread DB Tsai
Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, May 22, 2015 at 10:45 AM, Xin Liu liuxin...@gmail.com wrote: Thank you guys for the prompt help. I ended up building spark master and verified what DB has suggested. val lr = (new

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread DB Tsai
In Spark 1.4, Logistic Regression with elasticNet is implemented in ML pipeline framework. Model selection can be achieved through high lambda resulting lots of zero in the coefficients. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread DB Tsai
Hi Xin, If you take a look at the model you trained, the intercept from Spark is significantly smaller than StatsModel, and the intercept represents a prior on categories in LOR which causes the low accuracy in Spark implementation. In LogisticRegressionWithLBFGS, the intercept is regularized due

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai
the scaling and intercepts implicitly in objective function so no overhead of creating new transformed dataset. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Apr 29, 2015 at 1:21 AM, selim namsi selim.na...@gmail.com wrote: Thank

Re: Features scaling

2015-04-21 Thread DB Tsai
Hi Denys, I don't see any issue in your python code, so maybe there is a bug in python wrapper. If it's in scala, I think it should work. BTW, LogsticRegressionWithLBFGS does the standardization internally, so you don't need to do it yourself. It worths giving it a try! Sincerely, DB Tsai

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
it will cause problem for the algorithm. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote: Hello, I am new to spark streaming API. I wanted to ask if I can apply LBFGS

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai
We fixed couple issues in breeze LBFGS implementation. Can you try Spark 1.3 and see if they still exist? Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang c...@cjwang.us wrote: I

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai
Are you deploying the windows dll to linux machine? Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen davidshe...@gmail.com wrote: I think you meant to use the --files to deploy the DLLs. I gave

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai
I would recommend to upload those jars to HDFS, and use add jars option in spark-submit with URI from HDFS instead of URI from local filesystem. Thus, it can avoid the problem of fetching jars from driver which can be a bottleneck. Sincerely, DB Tsai

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai
PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai

Re: foreachActive functionality

2015-01-25 Thread DB Tsai
PS, we were using Breeze's activeIterator originally as you can see in the old code, but we found there are overhead there, so we implement our own implementation which results 4x faster. See https://github.com/apache/spark/pull/3288 for detail. Sincerely, DB Tsai

Re: Effects problems in logistic regression

2014-12-22 Thread DB Tsai
Sounds great. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: Thanks again DB Tsai

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-15 Thread DB Tsai
want to break down which part of your code causes the issue to make debugging easier. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 11, 2014 at 4:48 AM, Muhammad Ahsan muhammad.ah

Re: Including data nucleus tools

2014-12-15 Thread DB Tsai
Just out of my curiosity. Do you manually apply this patch and see if this can actually resolve the issue? It seems that it was merged at some point, but reverted due to that it causes some stability issue. Sincerely, DB Tsai --- My Blog: https

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
You need to do the StandardScaler to help the convergency yourself. LBFGS just takes whatever objective function you provide without doing any scaling. I will like to provide LinearRegressionWithLBFGS which does the scaling internally in the nearly feature. Sincerely, DB Tsai

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
the coefficients to the oringal space from the scaled space, the intercept can be computed by w0 = y - \sum x_n w_n where x_n is the average of column n. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
= scalerWithResponse.transform(rddVector).map(x= { (x(x.size - 1), Vectors.dense(x.toArray.slice(0, x.size -1)) }) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 12, 2014 at 12:23

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai
You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai

Re: Including data nucleus tools

2014-12-05 Thread DB Tsai
Can you try to run the same job using the assembly packaged by make-distribution as we discussed in the other thread. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog: https

Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
/apache/spark/mllib/util/LocalSparkContext.scala as an example. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton bur...@spinn3r.com wrote: What’s

Re: Shuffle issues in the current master

2014-10-25 Thread DB Tsai
Hi Andrew, We were running the master after SPARK-3613. Will give another shot against the current master while Josh fixed couple issues in shuffle. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
We don't have SVMWithLBFGS, but you can check out how we implement LogisticRegressionWithLBFGS, and we also deal with some condition number improving stuff in LogisticRegressionWithLBFGS which improves the performance dramatically. Sincerely, DB Tsai

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
oh, we just train the model in the standardized space which will help the convergence of LBFGS. Then we convert the weights to original space so the whole thing is transparent to users. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
yeah, column normalizarion. for some of the datasets, without doing this, it will not be converged. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Oct 24, 2014 at 3:46 PM, Debasish

Shuffle issues in the current master

2014-10-22 Thread DB Tsai
) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Sincerely, DB Tsai

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
It seems that this issue should be addressed by https://github.com/apache/spark/pull/2890 ? Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Oct 22, 2014 at 11:54 AM, DB

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Or can it be solved by setting both of the following setting into true for now? spark.shuffle.spill.compress true spark.shuffle.compress ture Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
PS, sorry for spamming the mailing list. Based my knowledge, both spark.shuffle.spill.compress and spark.shuffle.compress are default to true, so in theory, we should not run into this issue if we don't change any setting. Is there any other big we run into? Thanks. Sincerely, DB Tsai

Re: why fetch failed

2014-10-20 Thread DB Tsai
here https://github.com/cloudera/spark/tree/cdh5-1.1.0_5.2.0 PS, I don't test it yet, but will test it in the following couple days, and report back. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: How to emit multiple keys for the same value?

2014-10-20 Thread DB Tsai
You can do this using flatMap which return a Seq of (key, value) pairs. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 20, 2014 at 9:31 AM, HARIPRIYA AYYALASOMAYAJULA aharipriy

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread DB Tsai
- 1 until rdds.length) { temp = temp.unionAll(rdds(i)) } temp } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 13, 2014 at 7:22 PM, Nicholas Chammas nicholas.cham

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Sep 29, 2014 at 11:45 AM, Yanbo Liang yanboha...@gmail.com wrote: Thank you for all your patient response. I can conclude that if the data

Re: Fwd: Breeze Library usage in Spark

2014-10-03 Thread DB Tsai
You dont have to include breeze jar which is already in spark assembly jar. For native one, its optional. Sent from my Google Nexus 5 On Oct 3, 2014 8:04 PM, Priya Ch learnings.chitt...@gmail.com wrote: yes. I have included breeze-0.9 in build.sbt file. I ll change this to 0.7. Apart from

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread DB Tsai
by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread DB Tsai
. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Sep 13, 2014 at 2:12 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi All, I found that LogisticRegressionWithLBFGS interface

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-06 Thread DB Tsai
Yes. But you need to store RDD as *serialized* Java objects. See the session of storage level http://spark.apache.org/docs/latest/programming-guide.html Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread DB Tsai
For saving the memory, I recommend you compress the cached RDD, and it will be couple times smaller than original data sets. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread DB Tsai
we have internal version requiring some cleanup for open source project. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Debasish, I didn't try one-vs-all vs softmax regression. One issue is that for one-vs-all, we have to train k classifiers for k classes problem. The training time will be k times longer. Sincerely, DB Tsai --- My Blog: https

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
and there, so we're looking forward to your feedback, and please let us know what you think. We'll continue to improve it and we'll be adding Gradient Boosting in the near future as well. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-10 Thread DB Tsai
Spark cached the RDD in JVM, so presumably, yes, the singleton trick should work. Sent from my Google Nexus 5 On Aug 9, 2014 11:00 AM, Kevin James Matzen kmat...@cs.cornell.edu wrote: I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup().

  1   2   >