Re: Welcome two new Apache Spark committers

2023-08-06 Thread Debasish Das
Congratulations Peter and Xidou. On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan wrote: > Hi all, > > The Spark PMC recently voted to add two new committers. Please join me in > welcoming them to their new role! > > - Peter Toth (Spark SQL) > - Xiduo You (Spark SQL) > > They consistently make

Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Debasish Das
Congratulations Xinrong ! On Tue, Aug 9, 2022, 10:00 PM Rui Wang wrote: > Congrats Xinrong! > > > -Rui > > On Tue, Aug 9, 2022 at 8:57 PM Xingbo Jiang wrote: > >> Congratulations! >> >> Yuanjian Li 于2022年8月9日 周二20:31写道: >> >>> Congratulations, Xinrong! >>> >>> XiDuo You 于2022年8月9日 周二19:18写道:

Re: SIGMOD System Award for Apache Spark

2022-05-15 Thread Debasish Das
Congratulations to the whole spark community ! It's a great achievement. On Sat, May 14, 2022, 2:49 AM Yikun Jiang wrote: > Awesome! Congrats to the whole community! > > On Fri, May 13, 2022 at 3:44 AM Matei Zaharia > wrote: > >> Hi all, >> >> We recently found out that Apache Spark received

ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi, ECOS is a solver for second order conic programs and we showed the Spark integration at 2014 Spark Summit https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/. Right now the examples show how to reformulate matrix factorization as a SOCP and solve

Re: Hinge Gradient

2017-12-17 Thread Debasish Das
If you can point me to previous benchmarks that are done, I would like to use smoothing and see if the LBFGS convergence improved while not impacting linear svc loss. Thanks. Deb On Dec 16, 2017 7:48 PM, "Debasish Das" <debasish.da...@gmail.com> wrote: Hi Weichen, Traditionall

Re: Hinge Gradient

2017-12-16 Thread Debasish Das
hould be considered > carefully. > Is there any literature that proves changing max to soft-max can behave > well? > I’m more than happy to see some benchmarks if you can have. > > + Yuhao, who did similar effort in this PR: https://github.com/apache/ > spark/pull/17862 > > Rega

Hinge Gradient

2017-12-13 Thread Debasish Das
Hi, I looked into the LinearSVC flow and found the gradient for hinge as follows: Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x))) Therefore the gradient is -(2y - 1)*x max is a non-smooth function. Did we try using ReLu/Softmax function and use that to smooth the hinge

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Debasish Das
+1 Is there any design doc related to API/internal changes ? Will CP be the default in structured streaming or it's a mode in conjunction with exisiting behavior. Thanks. Deb On Nov 1, 2017 8:37 AM, "Reynold Xin" wrote: Earlier I sent out a discussion thread for CP in

Re: Spark Improvement Proposals

2016-10-16 Thread Debasish Das
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon as I looked into it since compared to writing Java map-reduce and Cascading code, Spark made writing distributed code fun...But now as we went deeper with Spark and real-time streaming use-case gets more prominent, I

Re: Using spark MLlib without installing Spark

2015-11-26 Thread Debasish Das
Decoupling mlllib and core is difficult...it is not intended to run spark core 1.5 with spark mllib 1.6 snapshot...core is more stabilized due to new algorithms getting added to mllib and sometimes you might be tempted to do that but its not recommend. On Nov 21, 2015 8:04 PM, "Reynold Xin"

Re: RDD API patterns

2015-09-17 Thread Debasish Das
Rdd nesting can lead to recursive nesting...i would like to know the usecase and why join can't support it...you can always expose an api over a rdd and access that in another rdd mappartition...use a external data source like hbase cassandra redis to support the api... For ur case group by and

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
, the access path is as follows: Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro Optimizer- HBase Scans/Gets - … - HBase Region server Regards, Yan *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Monday, July 27, 2015 10:02 PM *To:* Yan Zhou.sc

RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan, Is it possible to access the hbase table through spark sql jdbc layer ? Thanks. Deb On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Yes, but not all SQL-standard insert variants . *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Wednesday, July 22

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
AM, Debasish Das debasish.da...@gmail.com wrote: Yeah, I think the idea of confidence is a bit different than what I am looking for using implicit factorization to do document clustering. I basically need (r_ij - w_ih_j)^2 for all observed ratings and (0 - w_ih_j)^2 for all the unobserved

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
I will think further but in the current implicit formulation with confidence, looks like I am factorizing a 0/1 matrix with weights 1 + alpha*rating for observed (1) values and 1 for unobserved (0) values. It's a bit different from LSA model. On Sun, Jul 26, 2015 at 6:45 AM, Debasish Das

Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
heavily skewed to pay attention to the high-count instances. On Sun, Jul 26, 2015 at 9:19 AM, Debasish Das debasish.da...@gmail.com wrote: Yeah, I think the idea of confidence is a bit different than what I am looking for using implicit factorization to do document clustering. I

Confidence in implicit factorization

2015-07-25 Thread Debasish Das
Hi, Implicit factorization is important for us since it drives recommendation when modeling user click/no-click and also topic modeling to handle 0 counts in document x word matrices through NMF and Sparse Coding. I am a bit confused on this code: val c1 = alpha * math.abs(rating) if (rating

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ? On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote: We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase The main features in this

Gossip protocol in Master selection

2015-06-28 Thread Debasish Das
Hi, Akka cluster uses gossip protocol for Master election. The approach in Spark right now is to use Zookeeper for high availability. Interestingly Cassandra and Redis clusters are both using Gossip protocol. I am not sure what is the default behavior right now. If the master dies and zookeeper

Spark SQL 1.3 Exception

2015-06-24 Thread Debasish Das
, 2015 at 12:21 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have some impala created parquet tables which hive 0.13.2 can read fine. Now the same table when I want to read using Spark SQL 1.3 I am getting exception class exception that parquet.hive.serde.ParquetHiveSerde not found

Impala created parquet tables

2015-06-20 Thread Debasish Das
Hi, I have some impala created parquet tables which hive 0.13.2 can read fine. Now the same table when I want to read using Spark SQL 1.3 I am getting exception class exception that parquet.hive.serde.ParquetHiveSerde not found. I am assuming that hive somewhere is putting the

Velox Model Server

2015-06-20 Thread Debasish Das
Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based upon Velox or it uses a completely different architecture. https://github.com/amplab/velox-modelserver We are looking for an open

Re: Welcoming some new committers

2015-06-20 Thread Debasish Das
Congratulations to All. DB great work in bringing quasi newton methods to Spark ! On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen ches...@alpinenow.com wrote: Congratulations to All. DB and Sandy, great works ! On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

Streaming data + Blocked Model

2015-05-28 Thread Debasish Das
Hi, We want to keep the model created and loaded in memory through Spark batch context since blocked matrix operations are required to optimize on runtime. The data is streamed in through Kafka / raw sockets and Spark Streaming Context. We want to run some prediction operations with the

Re: spark packages

2015-05-24 Thread Debasish Das
Wendell pwend...@gmail.com wrote: Yes - spark packages can include non ASF licenses. On Sat, May 23, 2015 at 6:16 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Is it possible to add GPL/LGPL code on spark packages or it must be licensed under Apache as well ? I want to expose

Re: Kryo option changed

2015-05-24 Thread Debasish Das
Yu yuzhih...@gmail.com wrote: Pardon me. Please use '8192k' Cheers On Sat, May 23, 2015 at 6:24 PM, Debasish Das debasish.da...@gmail.com wrote: Tried 8mb...still I am failing on the same error... On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote: bq. it shuld be 8mb

Kryo option changed

2015-05-23 Thread Debasish Das
Hi, I am on last week's master but all the examples that set up the following .set(spark.kryoserializer.buffer, 8m) are failing with the following error: Exception in thread main java.lang.IllegalArgumentException: spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb. looks

spark packages

2015-05-23 Thread Debasish Das
Hi, Is it possible to add GPL/LGPL code on spark packages or it must be licensed under Apache as well ? I want to expose Professor Tim Davis's LGPL library for sparse algebra and ECOS GPL library through the package. Thanks. Deb

Re: Kryo option changed

2015-05-23 Thread Debasish Das
Tried 8mb...still I am failing on the same error... On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote: bq. it shuld be 8mb Please use the above syntax. Cheers On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am on last week's master

Power iteration clustering

2015-05-23 Thread Debasish Das
Hi, What was the motivation to write power iteration clustering using graphx and not a vector matrix multiplication over similarity matrix represented as say coordinate matrix ? We can use gemv in that flow to block the computation. Over graphx can we do all k eigen vector computation together

IndexedRowMatrix semantics

2015-05-20 Thread Debasish Das
Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse vectorbasically I am considering a gemv flow when indexedrowmatrix has dense flag true, dot flow otherwise... Thanks. Deb

Re: How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread Debasish Das
I opened it up today but it should help you: https://github.com/apache/spark/pull/6213 On Sat, May 16, 2015 at 6:18 PM, Chunnan Yao yaochun...@gmail.com wrote: Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the

Re: mllib.recommendation Design

2015-03-30 Thread Debasish Das
as I see the result. I am not sure if it is supported by public packages like graphlab or scikit but the plsa papers show interesting results. On Mar 30, 2015 2:31 PM, Xiangrui Meng men...@gmail.com wrote: On Wed, Mar 25, 2015 at 7:59 AM, Debasish Das debasish.da...@gmail.com wrote: Hi

Re: mllib.recommendation Design

2015-03-25 Thread Debasish Das
is that ALM will support MAP (and may be KL divergence loss) with sparsity constraints (probability simplex and bounds are fine for what I am focused at right now)... Thanks. Deb On Tue, Feb 17, 2015 at 4:40 PM, Debasish Das debasish.da...@gmail.com wrote: There is a usability difference...I am not sure

LogisticGradient Design

2015-03-25 Thread Debasish Das
Hi, Right now LogisticGradient implements both binary and multi-class in the same class using an if-else statement which is a bit convoluted. For Generalized matrix factorization, if the data has distinct ratings I want to use LeastSquareGradient (regression has given best results to date) but

Re: LogisticGradient Design

2015-03-25 Thread Debasish Das
multiclass logistic loss/gradient. If it's not a big hit, then it might be simpler from an outside API perspective to keep them in 1 class (even if it's more complicated within). Joseph On Wed, Mar 25, 2015 at 8:15 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Right now

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-21 Thread Debasish Das
to track this here: SPARK-6442 https://issues.apache.org/jira/browse/SPARK-6442 The design doc is here: http://goo.gl/sf5LCE We would very much appreciate your feedback and input. Best, Burak On Thu, Mar 19, 2015 at 3:06 PM, Debasish Das debasish.da...@gmail.com wrote: Yeah

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread Debasish Das
Hi David, We are stress testing breeze.optimize.proximal and nnls...if you are cutting a release now, we will need another release soon once we get the runtime optimizations in place and merged to breeze. Thanks. Deb On Mar 15, 2015 9:39 PM, David Hall david.lw.h...@gmail.com wrote: snapshot

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Debasish Das
Any reason why the regularization path cannot be implemented using current owlqn pr ? We can change owlqn in breeze to fit your needs... On Feb 24, 2015 3:27 PM, Joseph Bradley jos...@databricks.com wrote: Hi Mike, I'm not aware of a standard big dataset, but there are a number available:

If job fails shuffle space is not cleaned

2015-02-18 Thread Debasish Das
Hi, Some of my jobs failed due to no space left on device and on those jobs I was monitoring the shuffle space...when the job failed shuffle space did not clean and I had to manually clean it... Is there a JIRA already tracking this issue ? If no one has been assigned to it, I can take a look.

Re: Batch prediciton for ALS

2015-02-17 Thread Debasish Das
another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction API in ALS will be useful for us who want

Re: mllib.recommendation Design

2015-02-17 Thread Debasish Das
. For a general matrix factorization package, let's make a JIRA and move our discussion there. -Xiangrui On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am bit confused on the mllib design in the master. I thought that core algorithms will stay

Batch prediciton for ALS

2015-02-10 Thread Debasish Das
Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction API in ALS will be useful for us who want to cross validate on prec@k and MAP... Thanks. Deb

Re: Welcoming three new committers

2015-02-03 Thread Debasish Das
Congratulations ! Keep helping the community :-) On Tue, Feb 3, 2015 at 5:34 PM, Denny Lee denny.g@gmail.com wrote: Awesome stuff - congratulations! :) On Tue Feb 03 2015 at 5:34:06 PM Chao Chen crazy...@gmail.com wrote: Congratulations guys, well done! 在 15-2-4 上午9:26, Nan Zhu

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
For CDH this works well for me...tested till 5.1... ./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -Phive -DskipTests To build with hive thriftserver support for spark-sql On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
protobuf comes from missing -Phadoop2.3 On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen so...@cloudera.com wrote: What errors do you see? protobuf errors usually mean you didn't build for the right version of Hadoop, but if you are using -Phadoop-2.3 or better -Phadoop-2.4 that should be fine.

Row Similarity

2014-12-10 Thread Debasish Das
Hi, It seems there are multiple places where we would like to compute row similarity (accurate or approximate similarities) Basically through RowMatrix columnSimilarities we can compute column similarities of a tall skinny matrix Similarly we should have an API in RowMatrix called

Re: Row Similarity

2014-12-10 Thread Debasish Das
of a matrix A (i.e. computing AA^T, which is expensive). There is a JIRA to track handling (1) and (2) more efficiently than computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066 On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, It seems

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread Debasish Das
with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will reproduce the failure... https://issues.apache.org/jira/browse/SPARK-4231 The failed job I will debug more and figure out the real cause. If needed I will open up new JIRAs. On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Debasish Das
-1 from me...same FetchFailed issue as what Hector saw... I am running Netflix dataset and dumping out recommendation for all users. It shuffles around 100 GB data on disk to run a reduceByKey per user on utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset... I gave Spark 10

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
and appears in test, we can simply ignore it. -Xiangrui On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das debasish.da...@gmail.com wrote: Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key... Xiangrui, What's

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in

Using sampleByKey

2014-11-17 Thread Debasish Das
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x =

TimSort in 1.2

2014-11-13 Thread Debasish Das
Hi, I am noticing the first step for Spark jobs does a TimSort in 1.2 branch...and there is some time spent doing the TimSort...Is this assigning the RDD blocks to different nodes based on a sort order ? Could someone please point to a JIRA about this change so that I can read more about it ?

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-10 Thread Debasish Das
/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Debasish Das
+1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse

Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
:24 PM, Sean Owen so...@cloudera.com wrote: MAP is effectively an average over all k from 1 to min(# recommendations, # items rated) Getting first recommendations right is more important than the last. On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das debasish.da...@gmail.com wrote

MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
wonder if it is possible to extend the DIMSUM idea to computing top K matrix multiply between the user and item factor matrices, as opposed to all-pairs similarity of one matrix? On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das debasish.da...@gmail.com wrote: Is there an example of how to use

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
any of the topic modeling algorithms as well... Is there a better place for it other than mllib examples ? On Thu, Oct 30, 2014 at 8:13 AM, Debasish Das debasish.da...@gmail.com wrote: I thought topK will save us...for each user we have 1xrank...now our movie factor is a RDD...we pick topK movie

matrix factorization cross validation

2014-10-29 Thread Debasish Das
Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth looking into. If we consider the problem as a regression problem and the ratings 1-5 are considered as 5 classes, it is possible to generate a

Re: matrix factorization cross validation

2014-10-29 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth looking into. If we consider the problem as a regression problem and the ratings 1-5

Re: matrix factorization cross validation

2014-10-29 Thread Debasish Das
to examples.MovielensALS. ROC should be good to add as well. -Xiangrui On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth

Re: Oryx + Spark mllib

2014-10-19 Thread Debasish Das
wrote: Oryx 2 seems to be geared for Spark https://github.com/OryxProject/oryx 2014-10-18 11:46 GMT-04:00 Debasish Das debasish.da...@gmail.com: Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either

NNLS bug

2014-10-17 Thread Debasish Das
Hi, I am validating the proximal algorithm for positive and bound constrained ALS and I came across the bug detailed in the JIRA while running ALS with NNLS: https://issues.apache.org/jira/browse/SPARK-3987 ADMM based proximal algorithm came up with correct result... Thanks. Deb

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
in a different implementation and it has worked fine. Now I have to go hunt for how the QR decomposition is exposed in BLAS... Looks like its GEQRF which JBLAS helpfully exposes. Debasish you could try it for fun at least. On Oct 15, 2014 8:06 PM, Debasish Das debasish.da...@gmail.com wrote: But do

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
Just checked, QR is exposed by netlib: import org.netlib.lapack.Dgeqrf For the equality and bound version, I will use QR...it will be faster than the LU that I am using through jblas.solveSymmetric... On Thu, Oct 16, 2014 at 8:34 AM, Debasish Das debasish.da...@gmail.com wrote: @xiangrui

Issues with ALS positive definite

2014-10-15 Thread Debasish Das
Hi, If I take the Movielens data and run the default ALS with regularization as 0.0, I am hitting exception from LAPACK that the gram matrix is not positive definite. This is on the master branch. This is how I run it : ./bin/spark-submit --total-executor-cores 1 --master spark://

Re: Issues with ALS positive definite

2014-10-15 Thread Debasish Das
, 2014 at 5:01 PM, Liquan Pei liquan...@gmail.com wrote: Hi Debaish, I think ||r - wi'hj||^{2} is semi-positive definite. Thanks, Liquan On Wed, Oct 15, 2014 at 4:57 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, If I take the Movielens data and run the default ALS with regularization

Local tests logging to log4j

2014-10-07 Thread Debasish Das
Hi, I have added some changes to ALS tests and I am re-running tests as: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DwildcardSuites=org.apache.spark.mllib.recommendation.ALSSuite test I have some INFO logs in the code which I want to see on my console. They work fine if I add

Re: Local tests logging to log4j

2014-10-07 Thread Debasish Das
=ERROR log4j.logger.org.apache.zookeeper=WARN log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.I0Itec.zkclient=WARN On Tue, Oct 7, 2014 at 7:42 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have added some changes to ALS tests and I am re-running tests as: mvn

Cluster tests failing

2014-09-30 Thread Debasish Das
Hi, Inside mllib I am running tests using: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn install The locat tests run fine but cluster tests are failing.. LBFGSClusterSuite: - task size should be small *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage

Re: Cluster tests failing

2014-09-30 Thread Debasish Das
I have done mvn clean several times... Consistently all the mllib tests that are using LocalClusterSparkContext.scala, they fail !

Re: Hyper Parameter Optimization Algorithms

2014-09-29 Thread Debasish Das
You should look into Evan Spark's talk from Spark Summit 2014 http://spark-summit.org/2014/talk/model-search-at-scale I am not sure if some of it is already open sourced through MLBase... On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi locha...@gmail.com wrote: Hi, Is there anyone

Re: I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?

2014-09-19 Thread Debasish Das
Hi Xiangrui, Could you please point to some reference for calculating prec@k and ndcg@k ? prec is precision I suppose but ndcg I have no idea about... Thanks. Deb On Mon, Aug 25, 2014 at 12:28 PM, Xiangrui Meng men...@gmail.com wrote: The evaluation metrics are definitely useful. How do

Re: I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?

2014-09-19 Thread Debasish Das
Thanks Christoph. Are these numbers for mllib als implicit and explicit feedback on movielens/netflix datasets documented on JIRA ? On Sep 19, 2014 1:16 PM, Christoph Sawade christoph.saw...@googlemail.com wrote: Hey Deb, NDCG is the Normalized Discounted Cumulative Gain [1]. Another

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, During the 4th ALS iteration, I am

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
. -Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Sandy, Any resolution for YARN failures ? It's a blocker for running spark on top of YARN. Thanks. Deb On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
executors (unless ALS is using a bunch of off-heap memory?). You mentioned earlier in this thread that the property wasn't showing up in the Environment tab. Are you sure it's making it in? -Sandy On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das debasish.da...@gmail.com wrote: Hmm...I did try

Re: Lost executor on YARN ALS iterations

2014-08-21 Thread Debasish Das
configuration, yarn.nodemanager.vmem-check-enabled is set to false. -Sandy On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das debasish.da...@gmail.com wrote: I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is definitely a YARN related problem... At least for me right now only

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Debasish Das
be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das

Akka usage in Spark

2014-08-20 Thread Debasish Das
Hi, There have been some recent changes in the way akka is used in spark and I feel they are major changes... Is there a design document / JIRA / experiment on large datasets that highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great to understand where akka is used in the

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-19 Thread Debasish Das
? @dbtsai did your assembly on YARN ran fine or you are still noticing these exceptions ? Thanks. Deb On Thu, Aug 14, 2014 at 5:48 PM, Reynold Xin r...@databricks.com wrote: Here: https://github.com/apache/spark/pull/1948 On Thu, Aug 14, 2014 at 5:45 PM, Debasish Das debasish.da

Lost executor on YARN ALS iterations

2014-08-19 Thread Debasish Das
Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it

Spark on YARN webui

2014-08-18 Thread Debasish Das
Hi, We are running the snapshots (new spark features) on YARN and I was wondering if the webui is available on YARN mode... The deployment document does not mention webui on YARN mode... Is it available ? Thanks. Deb

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-15 Thread Debasish Das
5:48 PM, Reynold Xin r...@databricks.com wrote: Here: https://github.com/apache/spark/pull/1948 On Thu, Aug 14, 2014 at 5:45 PM, Debasish Das debasish.da...@gmail.com wrote: Is there a fix that I can test ? I have the flows setup for both standalone and YARN runs... Thanks. Deb

Kryo serialization issues

2014-08-14 Thread Debasish Das
Hi, Is there a JIRA for this bug ? I have seen it multiple times during our ALS runs now...some runs don't show while some runs fail due to the error msg https://github.com/GrahamDennis/spark-kryo-serialisation/blob/master/README.md One way to circumvent this is to not use kryo but then I am

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-12 Thread Debasish Das
I figured out the issuethe driver memory was at 512 MB and for our datasets, the following code needed more memory... // Materialize usersOut and productsOut. usersOut.count() productsOut.count() Thanks. Deb On Sat, Aug 9, 2014 at 6:12 PM, Debasish Das debasish.da...@gmail.com wrote

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
with Java 1.7_55 but the cluster JRE is at 1.7_45. Thanks. Deb On Wed, Aug 6, 2014 at 12:01 PM, Debasish Das debasish.da...@gmail.com wrote: I did not play with Hadoop settings...everything is compiled with 2.3.0CDH5.0.2 for me... I did try to bump the version number of HBase from 0.94 to 0.96

Re: [SNAPSHOT] Snapshot1 of Spark 1.1.0 has been posted

2014-08-08 Thread Debasish Das
Hi Patrick, I am testing the 1.1 branch but I see lot of protobuf warnings while building the jars: [warn] Class com.google.protobuf.Parser not found - continuing with a stub. [warn] Class com.google.protobuf.Parser not found - continuing with a stub. [warn] Class com.google.protobuf.Parser

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) On Tue, Aug 5, 2014 at 5:59 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, I used your idea and kept a cherry picked version

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
they differ in the final recommendation? It would be great if you can test prec@k or ndcg@k metrics. Best, Xiangrui On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
...@dbtsai.com wrote: One related question, is mllib jar independent from hadoop version (doesnt use hadoop api directly)? Can I use mllib jar compile for one version of hadoop and use it in another version of hadoop? Sent from my Google Nexus 5 On Aug 6, 2014 8:29 AM, Debasish Das debasish.da

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-05 Thread Debasish Das
, there might be bugs in it... Any suggestions will be appreciated Thanks. Deb On Sat, Aug 2, 2014 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote: Yes, that should work. spark-mllib-1.1.0 should be compatible with spark-core-1.0.1. On Sat, Aug 2, 2014 at 10:54 AM, Debasish Das debasish.da

Re: Master compilation with sbt

2014-07-20 Thread Debasish Das
On Sat, Jul 19, 2014 at 12:50 PM, Mark Hamstra m...@clearstorydata.com wrote: project mllib . . . clean . . . compile . . . test ...all works fine for me @2a732110d46712c535b75dd4f5a73761b6463aa8 On Sat, Jul 19, 2014 at 11:10 AM, Debasish Das

Master compilation with sbt

2014-07-19 Thread Debasish Das
Hi, Is sbt still used for master compilation ? I could compile for 2.3.0-cdh5.0.2 using maven following the instructions from the website: http://spark.apache.org/docs/latest/building-with-maven.html But when I am trying to use sbt for local testing and then I am getting some weird errors...Is

OWLQN

2014-07-18 Thread Debasish Das
Hi, I thought OWLQN is already merged to mllib optimization but I don't see it in the master yet... Are there any issues in merging it in ? I see there are some merge conflicts right now... https://github.com/apache/spark/pull/840/ Thanks. Deb

Re: PLSA

2014-07-04 Thread Debasish Das
Thanks for the pointer... Looks like you are using EM algorithm for factorization which looks similar to multiplicative update rules Do you think using mllib ALS implicit feedback, you can scale the problem further ? We can handle L1, L2, equality and positivity constraints in ALS now...As long

  1   2   >