Re: Offline elastic index creation

2022-11-10 Thread Debasish Das
Hi Vibhor, We worked on a project to create lucene indexes using spark but the project has not been managed for some time now. If there is interest we can resurrect it

Re: dremel paper example schema

2018-10-29 Thread Debasish Das
Open source impl of dremel is parquet ! On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta wrote: > Hi, > > why not just use dremel? > > Regards, > Gourav Sengupta > > On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev < > lubomir.chorbadj...@gmail.com> wrote: > >> Hi, >> >> I'm trying to reproduce the

ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi, ECOS is a solver for second order conic programs and we showed the Spark integration at 2014 Spark Summit https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/. Right now the examples show how to reformulate matrix factorization as a SOCP and solve

Re: Restful API Spark Application

2017-05-16 Thread Debasish Das
You can run l On May 15, 2017 3:29 PM, "Nipun Arora" wrote: > Thanks all for your response. I will have a look at them. > > Nipun > > On Sat, May 13, 2017 at 2:38 AM vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> It's in scala but it should be

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row similarity will run fine as well...check out spark-4823...you can compare quality with approximate variant... On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > Hi everyone, > Since spark 2.1.0

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-05 Thread Debasish Das
y to call predict on single vector. > There is no API exposed. It is WIP but not yet released. > > On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> If we expose an API to access the raw models out of PipelineModel can't >> we call predict direc

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
, graph and kernel models we use a lot and for them turned out that mllib style model predict were useful if we change the underlying store... On Feb 4, 2017 9:37 AM, "Debasish Das" <debasish.da...@gmail.com> wrote: > If we expose an API to access the raw models out of PipelineMo

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
res to score through spark.ml.Model >predict API". The predict API is in the old mllib package not the new ml >package. >- "why r we using dataframe and not the ML model directly from API" - >Because as of now the new ml package does not have the direct API.

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
I am not sure why I will use pipeline to do scoring...idea is to build a model, use model ser/deser feature to put it in the row or column store of choice and provide a api access to the model...we support these primitives in github.com/Verizon/trapezium...the api has access to spark context in

Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Debasish Das
You may want to pull up release/1.2 branch and 1.2.0 tag to build it yourself incase the packages are not available. On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim" wrote: > Hi Ayan, > > Thanks a million. > > Regards, > _ > *Md. Rezaul

Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
> gives an idea. Is it possible to make this more efficient? I don't want to >> use probabilistic functions, and I will cache the matrix because many >> distances are looked up at the matrix, computing them on demand would >> require far more computations.​ >> >>

Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on local I never paid attention but the code path should be similar... On Jan 18, 2016 8:00 AM, "Koert Kuipers" wrote: > stacktrace? details? > > On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
to add. You can add an issue in breeze for the enhancememt. Alternatively you can use breeze lpsolver as well that uses simplex from apache math. On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com> wrote: > Hi Debasish Das, > > Firstly I must show my deep appreciat

Re: apply simplex method to fix linear programming in spark

2015-11-03 Thread Debasish Das
> > On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > > Use breeze simplex which inturn uses apache maths simplex...if you want > to > > use interior point method you can use ecos > > https://github.com/embotech/ecos-java-scala ...

Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Debasish Das
Use breeze simplex which inturn uses apache maths simplex...if you want to use interior point method you can use ecos https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on quadratic solver in matrix factorization will show you example integration with spark. ecos runs as jni

Re: Running 2 spark application in parallel

2015-10-23 Thread Debasish Das
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on the same spark context you created (executors and cores)...same idea is used for spark sql thriftserver flow... For streaming i think it lets you run only one stream at a time even if you run them on multiple threads on

Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1 regularization which will yield elastic net style sparse solutionsusing that you can clean up edges which has 0.0 as weight... On Sep 7, 2015 7:35

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
, the access path is as follows: Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro Optimizer- HBase Scans/Gets - … - HBase Region server Regards, Yan *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Monday, July 27, 2015 10:02 PM *To:* Yan Zhou.sc

RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan, Is it possible to access the hbase table through spark sql jdbc layer ? Thanks. Deb On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Yes, but not all SQL-standard insert variants . *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Wednesday, July 22

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ? On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote: We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase The main features in this

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't havemost of the basic analysis that you need on stream can be done through mllib components... On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote: Sorry; I think I may have used poor wording. SparkR will let you use R to

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from 1000 users to 1 users ? On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: I have almost the same case. I will tell you what I am actually doing, if it is according to your requirement,

Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
how far it can be pushed. Thanks for your help! -- Eric On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com wrote: I got good runtime improvement from hive partitioninp, caching the dataset and increasing the cores through repartition...I think for your case

Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset and increasing the cores through repartition...I think for your case generating mysql style indexing will help further..it is not supported in spark sql yet... I know the dataset might be too big for 1 node mysql but do

Re: Velox Model Server

2015-06-24 Thread Debasish Das
Model sizes are 10m x rank, 100k x rank range. For recommendation/topic modeling I can run batch recommendAll and then keep serving the model using a distributed cache but then I can't incorporate per user model re-predict if user feedback is making the current topk stale. I have to wait for next

Re: Velox Model Server

2015-06-24 Thread Debasish Das
and reload factors from S3 periodically. We then use Elasticsearch to post-filter results and blend content-based stuff - which I think might be more efficient than SparkSQL for this particular purpose. On Wed, Jun 24, 2015 at 8:59 AM, Debasish Das debasish.da...@gmail.com wrote: Model sizes

Re: Velox Model Server

2015-06-22 Thread Debasish Das
engine probably doesn't matter at all in comparison. On Sat, Jun 20, 2015, 9:40 PM Debasish Das debasish.da...@gmail.com wrote: After getting used to Scala, writing Java is too much work :-) I am looking for scala based project that's using netty at its core (spray is one example

Velox Model Server

2015-06-20 Thread Debasish Das
Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based upon Velox or it uses a completely different architecture. https://github.com/amplab/velox-modelserver We are looking for an open

Re: Velox Model Server

2015-06-20 Thread Debasish Das
Integration of model server with ML pipeline API. On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto don...@prediction.io wrote: Mind if I ask what 1.3/1.4 ML features that you are looking for? On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote: After getting used to Scala

Re: Velox Model Server

2015-06-20 Thread Debasish Das
charles.ce...@gmail.com wrote: Is velox NOT open source? On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote: Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to each core. On each core may be there are multiple threads if you are using intel hyperthreading but I will let Spark handle the threading. On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com wrote: We

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS dgemm based calculation. On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat ayman.fara...@yahoo.com.invalid wrote: Thanks Sabarish and Nick Would you happen to have some code snippets that you can share. Best Ayman On Jun

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian rather than doing sc.union. Here are the details on the experiments: https://issues.apache.org/jira/browse/SPARK-4823 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com wrote: Also not sure how

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of interesting features as well... On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote: We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11,

Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem scale permits as there are guarantees on opti mization...owlqn and bfgs are both quasi newton Most single node code bases will run quasi newton solvesif you are using sgd better is to use adadelta/adagrad or similar

Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of fitted decision tree On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Is there an existing way in SparkML to convert a decision tree to a decision list? On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further... On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hive on Spark and SparkSQL which should be better , and what are the key characteristics and the advantages and the disadvantages

Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internallywe are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you

Re: Benchmaking col vs row similarities

2015-04-10 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Hi, I am benchmarking row vs col similarity flow on 60M x 10M matrices... Details are in this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 For testing I am using Netflix data since the structure is very similar: 50k x 17K near dense

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint

Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columnsif columns are much smaller than rows then col based flow works well...basically we need both flows... I did

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
sorted list by using a priority queue and dequeuing top N values. In the end, I get a record for each segment with N max values for each segment. Regards, Aung On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com wrote: In that case you can directly use count-min

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com wrote: You can do it in-memory as well

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
with 1.5m columns, because the output can potentially have 2.25 x 10^12 entries, which is a lot. (squares 1.5m) Best, Reza On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m and I made

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
that the key would be filtered. And then after, run a flatMap or something to make Option[B] into B. On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like

Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
partitions and apply your filtering. Then you can finish with a reduceByKey. On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based

Re: WARN from Similarity Calculation

2015-02-18 Thread Debasish Das
by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ?

WARN from Similarity Calculation

2015-02-15 Thread Debasish Das
Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
... Neither play nor spray is being used in Spark right nowso it brings dependencies and we already know about the akka conflicts...thriftserver on the other hand is already integrated for JDBC access On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com wrote: Also I wanted

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM,

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
-indexedrdd On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
PM, Debasish Das debasish.da...@gmail.com wrote: Thanks...this is what I was looking for... It will be great if Ankur can give brief details about it...Basically how does it contrast with memcached for example... On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com wrote

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das
Hi Dib, For our usecase I want my spark job1 to read from hdfs/cache and write to kafka queues. Similarly spark job2 should read from kafka queues and write to kafka queues. Is writing to kafka queues from spark job supported in your code ? Thanks Deb On Jan 15, 2015 11:21 PM, Akhil Das

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column similarity will give you a n x n matrix (product x product matrix)...this is also called product correlation matrix...it can be cosine, pearson or other kind of correlations...Note that if the entry is unobserved (user Joanary did

Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das
Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection 0...you are assuming if no one user

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master... If you want topk for a key run map / mappartition and use a bounded priority queue and reducebykey the queues. I experimented with topk from algebird and bounded priority queue wrapped over jpriority queue ( spark default)...bpq is faster Code example is here:

Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use it in

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Normally these operations are used for dictionary building and so I am hoping you can cache the dictionary of

Re: Spark on YARN

2014-11-18 Thread Debasish Das
I run my Spark on YARN jobs as: HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit --master yarn --jars test-job.jar --executor-cores 4 --num-executors 10 --executor-memory 16g --driver-memory 4g --class TestClass test.jar It uses HADOOP_CONF_DIR to schedule executors and

Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das

Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi, If I look inside algebird Monoid implementation it uses java.io.Serializable... But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't see a KryoRegistrator for CMS and HLL monoid... In these examples we will run with Kryo serialization on CMS and HLL or they will be java

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi, I just built the master today and I was testing the IR metrics (MAP and prec@k) on Movielens data to establish a baseline... I am getting a weird error which I have not seen before: MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example mllib.MovieLensALS --kryo --lambda 0.065

Re: Spark LIBLINEAR

2014-10-27 Thread Debasish Das
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote: Debasish Das writes: If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of the inverse of B matrix ? But then B matrix keeps on changing...did u condition it after every few iterations ? Will it be possible to put that code in Breeze since it will be very useful to condition other solvers as

Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin, This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;] and x is what you want to find..B is 0 in this case...For mllib normally this is labelbasically create a labeledPoint where label is 0 always... Use mllib's linear regression and solve the following

Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
the architecture. It has all the things you are thinking about:) Thanks, Jayant On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming

Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which

Re: protobuf error running spark on hadoop 2.4

2014-10-08 Thread Debasish Das
I have faced this in the past and I have to put a profile -Phadoop2.3... mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote: Hi: I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote: I think you mean that data2 is a function of data1 in the

Impala comparisons

2014-10-04 Thread Debasish Das
Hi, We write the output of models and other information as parquet files and later we let data APIs run SQL queries on the columnar data... SparkSQL is used to dump the data in parquet format and now we are considering whether using SparkSQL or Impala to read it back... I came across this

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com wrote: Just realized that, of course, objects can't be generic, but how do I create a

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote: We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote: Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the

Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative algorithm For map-reduce operations, it's better not to cache if you have a memory crunch... Also schedule the persist and unpersist such that you utilize the RAM well... On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to

Re: task getting stuck

2014-09-24 Thread Debasish Das
On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com wrote: HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
affected, but may not happen to have exhibited in your test. On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com wrote: Some more debug revealed that as Sean said I have to keep the dictionaries persisted till I am done with the RDD manipulation. Thanks Sean

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key almonds at value 5187 using: rdd.filter{case(product, index) = product == almonds}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out as:

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
- zipWithIndex is being used to assign IDs. From a recent JIRA discussion I understand this is not deterministic within a partition so the index can be different when the RDD is reevaluated. If you need it fixed, persist the zipped RDD on disk or in memory. On Sep 20, 2014 8:10 PM, Debasish Das

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
clear from docs... On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com wrote: I did not persist / cache it as I assumed zipWithIndex will preserve order... There is also zipWithUniqueId...I am trying that...If that also shows the same issue, we should make it clear

Re: MLLib: LIBSVM issue

2014-09-18 Thread Debasish Das
We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out libsvm format from SparseVector... def toLibSvm(features: SparseVector): String = { val indices = features.indices.map(_ + 1) val values = features.values indices.zip(values).mkString(

Re: Huge matrix

2014-09-18 Thread Debasish Das
, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measures from the dot product. Best, Reza On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, In similarColumns, it seems with cosine similarity

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I am able to

Re: Huge matrix

2014-09-18 Thread Debasish Das
the PR. We can add jaccard and other similarity measures in later PRs. In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measures from the dot product. Best, Reza On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das

Re: Huge matrix

2014-09-18 Thread Debasish Das
. The PR will updated today. Best, Reza On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ? This number does not look like

Re: Huge matrix

2014-09-17 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity

  1   2   >