Re: Actors and sparkcontext actions

2014-03-04 Thread Debasish Das
Hi Ognen, Any particular reason of choosing scalatra over options like play or spray ? Is scalatra much better in serving apis or is it due to similarity with ruby's sinatra ? Did you try the other options and then pick scalatra ? Thanks. Deb On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski

Spark join for skewed dataset

2014-03-15 Thread Debasish Das
Hi, If the join keys are skewed is there are specific optimized join available in Spark for such usecases ? I saw in both scalding and Hive similar feature is supported and I am testing skewjoinWithSmaller on one of the skewed dataset...

Maximum memory limits

2014-03-16 Thread Debasish Das
Hi, I gave my spark job 16 gb of memory and it is running on 8 executors. The job needs more memory due to ALS requirements (20M x 1M matrix) On each node I do have 96 gb of memory and I am using 16 gb out of it. I want to increase the memory but I am not sure what is the right way to do

How to kill a spark app ?

2014-03-16 Thread Debasish Das
Are these the right options: 1. If there is a spark script, just do a ctrl-c from spark-shell and the job will be killed property. 2. For spark application also ctrl c will kill the job property on the cluster: Somehow the ctrl-c option did not work for us... Similar option works fine for

Re: How to kill a spark app ?

2014-03-16 Thread Debasish Das
Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, Mar 16, 2014 at 2:59 PM, Debasish Das debasish.da...@gmail.comwrote: Are these the right options: 1. If there is a spark script, just do a ctrl-c from spark-shell and the job

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Debasish Das
I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-24 Thread Debasish Das
After a long time (may be a month) I could do a fresh build for 2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache My case is especially painful since I have to build behind a firewall... @Sean thanks for the fix...I think we should put a test for https/firewall compilation as

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Niko, Comparing some other components will be very useful as wellsvd++ from graphx vs the same algorithm in graphlabalso mllib.recommendation.als implicit/explicit compared to the collaborative filtering toolkit in graphlab... To stress test what's the biggest sparse dataset that you

ALS memory limits

2014-03-26 Thread Debasish Das
, not native code. This should be exactly what that PR I mentioned fixes. -- Sean Owen | Director, Data Science | London On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das debasish.da...@gmail.com wrote: Thanks Sean...let me get the latest code..do you know which PR

Spark preferred compression format

2014-03-26 Thread Debasish Das
Hi, What's the splittable compression format that works with Spark right now ? We are looking into bzip2 / lzo / gzip...gzip is not splittable so not a good optionWithin bzip2/lzo I am confused. Thanks. Deb

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Debasish Das
Classes are serialized and sent to all the workers as akka msgs singletons and case classes I am not sure if they are javaserialized or kryoserialized by default But definitely your own classes if serialized by kryo will be much efficient.there is an comparison that Matei did for all

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Debasish Das
I think multiply by ratings is a heuristic that worked on rating related problems like netflix dataset or any other ratings datasets but the scope of NMF is much more broad than that @Sean please correct me in case you don't agree... Definitely it's good to add all the rating dataset related

Re: Optimal Server Design for Spark

2014-04-02 Thread Debasish Das
Hi Matei, How can I run multiple Spark workers per node ? I am running 8 core 10 node cluster but I do have 8 more cores on each nodeSo having 2 workers per node will definitely help my usecase. Thanks. Deb On Wed, Apr 2, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hey

Re: Optimal Server Design for Spark

2014-04-03 Thread Debasish Das
://www.sigmoidanalytics.com @mayur_rustagi On Wed, Apr 2, 2014 at 7:19 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Matei, How can I run multiple Spark workers per node ? I am running 8 core 10 node cluster but I do have 8 more cores on each nodeSo having 2 workers per node

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
: spark.storage.blockManagerSlaveTimeoutMs to a higher value. In your case it's setting this to 45000 or 45 seconds. On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das debasish.da...@gmail.comwrote: Hi, In my ALS runs I am noticing messages that complain about heart beats: 14/04/04 20:43:09 WARN

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
From the documentation this is what I understood: 1. spark.worker.timeout: Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats. default: 60 I increased it to be 600 It was pointed before that if there is GC overload and the worker

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
to GC...Persisting the factors to disk at each iteration will resolve this issue with runtime loss of course... I also have another issue...I run with executor memory as 24g but I see 18.4 GB in executor ui...is that expected ? On Sat, Apr 5, 2014 at 8:16 AM, Debasish Das debasish.da

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
When you say Spark is one of the forerunners for our technology choice, what are the other options you are looking into ? I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based on google planet paper. .. I am sure people in devlist are already working on it...send an email to

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
). Additionally, the lack of multi-class classification limits its applicability. Also, RF requires random features per tree node to be effective (not just bootstrap samples), and MLLib decision tree doesn't support that. On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das debasish.da...@gmail.com

Re: Random Forest on Spark

2014-04-18 Thread Debasish Das
support that. On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das debasish.da...@gmail.com wrote: Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based

Re: running SparkALS

2014-04-28 Thread Debasish Das
, if I did want to run that example, where would I find the file in question? It would be great if this were documented, perhaps in the source code. I'll add a JIRA. Thanks, Diana On Mon, Apr 28, 2014 at 1:41 PM, Debasish Das debasish.da...@gmail.comwrote: Diana, Here are the parameters

Re: Schema view of HadoopRDD

2014-05-10 Thread Debasish Das
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... One solution is to keep data on hdfs as avro/protobuf serialized objects but not sure if that works on HBase

Re: Spark LIBLINEAR

2014-05-11 Thread Debasish Das
Hello Prof. Lin, Awesome news ! I am curious if you have any benchmarks comparing C++ MPI with Scala Spark liblinear implementations... Is Spark Liblinear apache licensed or there are any specific restrictions on using it ? Except using native blas libraries (which each user has to manage by

Turn BLAS on MacOSX

2014-05-13 Thread Debasish Das
Hi, How do I load native BLAS libraries on Mac ? I am getting the following errors while running LR and SVM with SGD: 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:

Re: Spark LIBLINEAR

2014-05-14 Thread Debasish Das
is compatible with Apache. Best, Xiangrui On Sun, May 11, 2014 at 10:29 AM, Debasish Das debasish.da...@gmail.com wrote: Hello Prof. Lin, Awesome news ! I am curious if you have any benchmarks comparing C++ MPI with Scala Spark liblinear implementations... Is Spark Liblinear apache licensed

Schema view of HadoopRDD

2014-05-15 Thread Debasish Das
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... Does RDDs provide a schema view of the dataset on HDFS ? Thanks. Deb

Re: How to run the SVM and LogisticRegression

2014-05-16 Thread Debasish Das
There are examples to run them in BinaryClassification.scala in org.apache.spark.examples... On Wed, May 14, 2014 at 1:36 PM, yxzhao yxz...@ualr.edu wrote: Hello, I found the classfication algorithms SVM and LogisticRegression implemented in the following directory. And how to run them?

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Debasish Das
Xiangrui, Could you point to the JIRA related to tree aggregate ? ...sounds like the allreduce idea... I would definitely like to try it on our dataset... Makoto, I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10

Re: Shark vs Impala

2014-06-22 Thread Debasish Das
600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is in memory when you run the query ? Could you publish the query ? Also after

Databricks demo

2014-07-11 Thread Debasish Das
Hi, Databricks demo at spark summit was amazing...what's the frontend stack used specifically for rendering multiple reactive charts on same dom? Looks like that's an emerging pattern for correlating different data api... Thanks Deb

Re: spark1.0.1 hadoop2.2.0 issue

2014-07-19 Thread Debasish Das
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today... No issues with mvn compilation but my sbt build keeps failing on the sql module... I just saw that my scala is at 2.11.0 (with brew update)...not sure if that's why the sbt compilation is failing...retrying.. On Sat, Jul 19, 2014 at 6:16 PM,

Re: spark1.0.1 hadoop2.2.0 issue

2014-07-20 Thread Debasish Das
Yup...the scala version 2.11.0 caused it...with 2.10.4, I could compile 1.0.1 and HEAD both for 2.3.0cdh5.0.2 On Sat, Jul 19, 2014 at 8:14 PM, Debasish Das debasish.da...@gmail.com wrote: I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today... No issues with mvn compilation but my sbt build

Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
Hi, We have been using standalone spark for last 6 months and I used to run application jars fine on spark cluster with the following command. java -cp :/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar -Xms2g -Xmx2g

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-28 Thread Debasish Das
Hi Aureliano, Will it be possible for you to give the test-case ? You can add it to JIRA as well as an attachment I guess... I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab experiments with scaling the rank to 1000 and beyond (which is too high for ALS but gives a good

Re: Contribution to Spark MLLib

2014-08-13 Thread Debasish Das
Dennis, If it is PLSA with least square loss then the QuadraticMinimizer that we open sourced should be able to solve it for modest topics (till 1000 I believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper is the reference) the topic size can be increased much larger than

SPARK_LOCAL_DIRS option

2014-08-13 Thread Debasish Das
Hi, I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can use more shuffle space... Does Spark cleans all the shuffle files once the runs are done ? Seems to me that the shuffle files are not cleaned... Do I need to set this variable ? spark.cleaner.ttl Right now we are

Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday... I had to put it in spark-env.sh and take it out from spark-defaults.conf on 1.0.1...Note that this settings should be visible on all workers.. After that I validated that SPARK_LOCAL_DIRS was indeed getting used for shuffling... On Thu, Aug 14, 2014 at 10:27

Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi, For our large ALS runs, we are considering using sc.setCheckPointDir so that the intermediate factors are written to HDFS and the lineage is broken... Is there a comparison which shows the performance degradation due to these options ? If not I will be happy to add experiments with it...

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Debasish Das
DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor

ALS checkpoint performance

2014-08-15 Thread Debasish Das
Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon, Looks very cool...will try it out for ad-hoc analysis of our datasets and provide more feedback... Could you please give bit more details about the differences of Spindle architecture compared to Hue + Spark integration (python stack) and Ooyala Jobserver ? Does Spindle allow

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Debasish Das
Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com wrote: you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: Huge matrix

2014-09-05 Thread Debasish Das
/apache/spark/pull/1778 Your question wasn't entirely clear - does this answer it? Best, Reza On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following

Re: Huge matrix

2014-09-05 Thread Debasish Das
that you don't have to redo your code. Your call if you need it before a week. Reza On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com wrote: Ohh coolall-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset... Thanks. Deb On Fri, Sep 5

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das debasish.da

Re: Huge matrix

2014-09-05 Thread Debasish Das
) if your goal is to find batches of similar points instead of all pairs above a threshold. On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das debasish.da...@gmail.com wrote: Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension

Re: Huge matrix

2014-09-05 Thread Debasish Das
, calling dimsum with gamma as PositiveInfinity turns it into the usual brute force algorithm for cosine similarity, there is no sampling. This is by design. On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com wrote: I looked at the code: similarColumns(Double.posInf

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark since I was comparing ip with proximal algorithms but we can expose suitesparse api as well...jni is used to load up ldl amd and ecos

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
version of ldl and amd which are lgpl... Let me know. Thanks. Deb On Sep 8, 2014 7:04 AM, Debasish Das debasish.da...@gmail.com wrote: Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
in a distributed way. -Xiangrui On Mon, Sep 8, 2014 at 7:12 AM, Debasish Das debasish.da...@gmail.com wrote: Xiangrui, Should I open up a JIRA for this ? Distributed lp/socp solver through ecos/ldl/amd ? I can open source it with gpl license in spark code as that's what our legal

Re: Huge matrix

2014-09-09 Thread Debasish Das
similarity in a future PR, probably still for 1.2 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com wrote: Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be useful) ? I guess it makes sense to add some

Re: Huge matrix

2014-09-09 Thread Debasish Das
one. For dense matrices with say, 1m columns this won't be computationally feasible and you'll want to start sampling with dimsum. It would be helpful to have a loadRowMatrix function, I would use it. Best, Reza On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com wrote

Re: Huge matrix

2014-09-17 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity

Re: MLLib: LIBSVM issue

2014-09-18 Thread Debasish Das
We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out libsvm format from SparseVector... def toLibSvm(features: SparseVector): String = { val indices = features.indices.map(_ + 1) val values = features.values indices.zip(values).mkString(

Re: Huge matrix

2014-09-18 Thread Debasish Das
, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measures from the dot product. Best, Reza On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, In similarColumns, it seems with cosine similarity

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I am able to

Re: Huge matrix

2014-09-18 Thread Debasish Das
the PR. We can add jaccard and other similarity measures in later PRs. In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measures from the dot product. Best, Reza On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das

Re: Huge matrix

2014-09-18 Thread Debasish Das
. The PR will updated today. Best, Reza On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ? This number does not look like

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key almonds at value 5187 using: rdd.filter{case(product, index) = product == almonds}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out as:

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
- zipWithIndex is being used to assign IDs. From a recent JIRA discussion I understand this is not deterministic within a partition so the index can be different when the RDD is reevaluated. If you need it fixed, persist the zipped RDD on disk or in memory. On Sep 20, 2014 8:10 PM, Debasish Das

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
clear from docs... On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com wrote: I did not persist / cache it as I assumed zipWithIndex will preserve order... There is also zipWithUniqueId...I am trying that...If that also shows the same issue, we should make it clear

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
affected, but may not happen to have exhibited in your test. On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com wrote: Some more debug revealed that as Sean said I have to keep the dictionaries persisted till I am done with the RDD manipulation. Thanks Sean

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to

Re: task getting stuck

2014-09-24 Thread Debasish Das
On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com wrote: HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote: Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the

Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative algorithm For map-reduce operations, it's better not to cache if you have a memory crunch... Also schedule the persist and unpersist such that you utilize the RAM well... On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com wrote: Just realized that, of course, objects can't be generic, but how do I create a

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote: We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the

Impala comparisons

2014-10-04 Thread Debasish Das
Hi, We write the output of models and other information as parquet files and later we let data APIs run SQL queries on the columnar data... SparkSQL is used to dump the data in parquet format and now we are considering whether using SparkSQL or Impala to read it back... I came across this

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote: I think you mean that data2 is a function of data1 in the

Re: protobuf error running spark on hadoop 2.4

2014-10-08 Thread Debasish Das
I have faced this in the past and I have to put a profile -Phadoop2.3... mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote: Hi: I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which

Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from

Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
the architecture. It has all the things you are thinking about:) Thanks, Jayant On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming

Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin, This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;] and x is what you want to find..B is 0 in this case...For mllib normally this is labelbasically create a labeledPoint where label is 0 always... Use mllib's linear regression and solve the following

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of the inverse of B matrix ? But then B matrix keeps on changing...did u condition it after every few iterations ? Will it be possible to put that code in Breeze since it will be very useful to condition other solvers as

Re: Spark LIBLINEAR

2014-10-27 Thread Debasish Das
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote: Debasish Das writes: If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi, I just built the master today and I was testing the IR metrics (MAP and prec@k) on Movielens data to establish a baseline... I am getting a weird error which I have not seen before: MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example mllib.MovieLensALS --kryo --lambda 0.065

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das

Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi, If I look inside algebird Monoid implementation it uses java.io.Serializable... But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't see a KryoRegistrator for CMS and HLL monoid... In these examples we will run with Kryo serialization on CMS and HLL or they will be java

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use it in

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Normally these operations are used for dictionary building and so I am hoping you can cache the dictionary of

Re: Spark on YARN

2014-11-18 Thread Debasish Das
I run my Spark on YARN jobs as: HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit --master yarn --jars test-job.jar --executor-cores 4 --num-executors 10 --executor-memory 16g --driver-memory 4g --class TestClass test.jar It uses HADOOP_CONF_DIR to schedule executors and

Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master... If you want topk for a key run map / mappartition and use a bounded priority queue and reducebykey the queues. I experimented with topk from algebird and bounded priority queue wrapped over jpriority queue ( spark default)...bpq is faster Code example is here:

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection 0...you are assuming if no one user

Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das
Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column similarity will give you a n x n matrix (product x product matrix)...this is also called product correlation matrix...it can be cosine, pearson or other kind of correlations...Note that if the entry is unobserved (user Joanary did

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das
Hi Dib, For our usecase I want my spark job1 to read from hdfs/cache and write to kafka queues. Similarly spark job2 should read from kafka queues and write to kafka queues. Is writing to kafka queues from spark job supported in your code ? Thanks Deb On Jan 15, 2015 11:21 PM, Akhil Das

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
... Neither play nor spray is being used in Spark right nowso it brings dependencies and we already know about the akka conflicts...thriftserver on the other hand is already integrated for JDBC access On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com wrote: Also I wanted

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ?

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
that the key would be filtered. And then after, run a flatMap or something to make Option[B] into B. On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like

  1   2   >