Re: MLlib LDA implementation questions

2015-09-11 Thread Carsten Schnober
Hi,
I don't have practical experience with the MLlib LDA implementation, but
regarding the variations in the topic matrix: LDA make use of stochastic
processes. If you use setSeed(seed) with the same value for seed during
initialization, your results should be identical though.

May I ask what exactly you refer to with prediction? Topic assignments
(inference)?

Best,
Carsten


Am 11.09.2015 um 15:29 schrieb Marko Asplund:
> Hi,
> 
> We're considering using Spark MLlib (v >= 1.5) LDA implementation for
> topic modelling. We plan to train the model using a data set of about 12
> M documents and vocabulary size of 200-300 k items. Documents are
> relatively short, typically containing less than 10 words, but the
> number can range up to tens of words. The model would be updated
> periodically using e.g. a batch process while predictions will be
> queried by a long-running application process in which we plan to embed
> MLlib.
> 
> Is the MLlib LDA implementation considered to be well-suited to this
> kind of use case?
> 
> I did some prototyping based on the code samples on "MLlib - Clustering"
> page and noticed that the topics matrix values seem to vary quite a bit
> across training runs even with the exact same input data set. During
> prediction I observed similar behaviour.
> Is this due to the probabilistic nature of the LDA algorithm?
> 
> Any caveats to be aware of with the LDA implementation?
> 
> For reference, my prototype code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala
> 
> 
> thanks,
> marko

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



K Nearest Neighbours

2015-07-10 Thread Carsten Schnober
Hi,
I have the following problem, which is a kind of special case of k
nearest neighbours.
I have an Array of Vectors (v1) and an RDD[(Long, Vector)] of pairs of
vectors with indexes (v2). The array v1 easily fits into a single node's
memory (~100 entries), but v2 is very large (millions of entries).

My goal is to find for each vector in v1 the entries in v2 with least
distance. The naive solution would be to define a helper function that
computes all the distances between a vector from v1 and all vectors in
v2, sorts them, and returns the top n results:

def computeDistances(vector: Vector, vectors: RDD[(Long, Vector)],
n:Int=10): Seq[Long] =  {
vectors.map { emb => (emb._1, Vectors.sqdist(emb._2, centroid)) }
  .sortBy(_._2) // sort by value
  .map(_._1) // retain indexes only
  .take(n)
}

So I can map the entries (after getting the indexes to keep track of the
mappings) in v1 to the distances:

v1.zipWithIndexes.map{ v => (computeDistances(v._1, v2), v._2) }

This gives me for each entry in v1 the indexes of the n closest entries
in v2.
However, as v1 is an array, the computeDistances() calls are all done
sequentially (on the driver, if I understand correctly) rather than
distributed.

The problem is that I must not convert v1 into an RDD because that will
result in an error due to nested RDD actions in computeDistance().

To conclude, what I would like to do (if it were possible) is this:

val v1: Seq[Vector] = ...
val v2: RDD[(Long, Vector)] = ...
sc.parallelize(v1).zipWithIndexes
  .map{ v => (computeDistances(v._1, v2), v._2) }


Is there any good practice to approach problems like this?
Thanks!
Carsten


-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-Nearest-Neighbours-tp23759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Word2Vec distributed?

2015-07-10 Thread Carsten Schnober
Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems to
be working, using one CPU.

I suppose this is related to the discussion in [1], essentially stating
that the original algorithm allows for multi-threading, but not for
distributed computation due to frequent internal communication.

To my understanding, this issue has not been fully resolved in Spark,
has it? I just wonder whether I am interpreting the current situation
correctly.

Thanks!
Carsten

[1] https://issues.apache.org/jira/browse/SPARK-2510

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Word2Vec-distributed-tp23758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Word2Vec distributed?

2015-07-08 Thread Carsten Schnober
Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems to
be working, using one CPU.

I suppose this is related to the discussion in [1], essentially stating
that the original algorithm allows for multi-threading, but not for
distributed computation due to frequent internal communication.

To my understanding, this issue has not been fully resolved in Spark,
has it? I just wonder whether I am interpreting the current situation
correctly.

Thanks!
Carsten

[1] https://issues.apache.org/jira/browse/SPARK-2510

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: word2vec: how to save an mllib model and reload it?

2015-02-05 Thread Carsten Schnober
As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
our Hadoop cluster and here's my issue with classic Java serialization of
the model: I don't have SSH access to the cluster master node.  
Here's my code for computing the model:

val input = sc.textFile("README.md").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec();
val model = word2vec.fit(input);
val oos = new ObjectOutputStream(new FileOutputStream(modelFile));
oos.writeObject(model);
oos.close();

I can do that locally and get the file as desired. But that is of little use
for me if the file is stored on the master.

I've alternatively serialized the vectors to HDFS using this code:

val vectors = model.getVectors;   
val output = sc.parallelize(vectors.toSeq);
output.saveAsObjectFile(modelFile);

Indeed, this results in a serialization on HDFS so I can access it as a
user. However, I have not figured out how to create a new Word2VecModel
object from those files.

Any clues?
Thanks!
Carsten



MLnick wrote
> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed. 
> 
> 
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless
> the model is truly massive (ie cannot fit into memory on a modern high end
> box with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault
> tolerance).
> 
> 
> 
> 
> What is your use case for model serving?
> 
> 
> —
> Sent from Mailbox
> 
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <

> duy.huynh.uiv@

> > wrote:
> 
>> you're right, serialization works.
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their
>> own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and
>> saves
>> at the master level.
>> much appreciated.
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <

> evan.sparks@

> >
>> wrote:
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>> merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running
>>> save(), same with matlab. In python either pickling things or dumping to
>>> json seems pretty common. (even the scikit-learn docs recommend pickling
>>> -
>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>> all
>>> seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <

> duy.huynh.uiv@

> >
>>> wrote:
>>>
 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <

> evan.sparks@

> >
 wrote:

> Plain old java serialization is one straightforward approach if you're
> in java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll <

> duy.huynh.uiv@

> > wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec
>> model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>
>

>>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org