Re: How to speed up MLlib LDA?

2015-09-22 Thread Charles Earl
It seems that the Vowpal Wabbit version is most similar to what is in https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala Although the Intel seems to implement the Hierarchical Dirichlet Process (topics and subtopics) as

Re: Spark MLib v/s SparkR

2015-08-05 Thread Charles Earl
What machine learning algorithms are you interested in exploring or using? Start from there or better yet the problem you are trying to solve, and then the selection may be evident. On Wednesday, August 5, 2015, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib

Re: Velox Model Server

2015-06-20 Thread Charles Earl
Is velox NOT open source? On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote: Hi, The demo of end-to-end ML pipeline including the model server component at Spark Summit was really cool. I was wondering if the Model Server component is based upon Velox or it uses a

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would tachyon be appropriate here? On Friday, June 5, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would the IndexedRDD feature provide what the Lookup RDD does? I'Ve been using a broadcast variable map for a similar kind of thing -- It probably is within 1GB but interested to know if the lookup (or indexed) might be better. C On Friday, June 5, 2015, Dmitry Goldenberg dgoldenberg...@gmail.com

LDA prediction on new document

2015-05-22 Thread Charles Earl
Dani, Folding in I believe refers to setting up your Gibbs sampler (or other model) with the learning word and document topic proportions as computed by spark. You might look at https://lists.cs.princeton.edu/pipermail/topic-models/2014-May/002763.html Where Jones suggests summing across

Re: Does newly-released LDA (Latent Dirichlet Allocation) algorithm supports ngrams?

2015-03-19 Thread Charles Earl
Heszak, I have only glanced at it but you should be able to incorporate tokens approximating n-gram yourself, say by using the lucene ShingleAnalyzerWrapper API http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapper.html You might also take a

Re: Status of MLLib exporting models to PMML

2014-11-18 Thread Charles Earl
Yes, The case is convincing for PMML with Oryx. I will also investigate parameter server. Cheers, Charles On Tuesday, November 18, 2014, Sean Owen so...@cloudera.com wrote: I'm just using PMML. I haven't hit any limitation of its expressiveness, for the model types is supports. I don't think

Re: Status of MLLib exporting models to PMML

2014-11-16 Thread Charles Earl
Manish and others, A follow up question on my mind is whether there are protobuf (or other binary format) frameworks in the vein of PMML. Perhaps scientific data storage frameworks like netcdf, root are possible also. I like the comprehensiveness of PMML but as you mention the complexity of

Anything like grid search available for mlbase?

2014-06-20 Thread Charles Earl
Looking for something like scikit's grid search module. C

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Charles Earl
While I can't definitively speak to MLLib online learning, I'm sure you're evaluating Vowpal Wabbit, for which there's been some storm integrations contributed. Also you might look at factorie, http://factorie.cs.understanding.edu, which at least provides an online lda. C On Thursday, June 19,