Hello everyone,

I want to know if it's possible to do a clustering of documents in
SolrCloud indices (multiple "index" directories) and how would one
accomplish that.

---

I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT

I can cluster documents from one Lucene/Solr index. I can even cluster
documents from a Solr 4.x index (the same version that implements the
distributed SolrCloud).

As I know, SorlCloud uses indices distributed in files across "shards" as
one big index.

The problem is that although I can cluster documents from one index, from
one "shard"/"SolrCore", I can't cluster the documents from the whole index.
...Or at least, I don't know how to do it.

I used Mahout with the lucene.vector tool, it gets one index directory and
outputs a "vector" file (if I'm not wrong) and a text "dictionary". Then I
can use Mahout with, for example, kmeans to cluster the "documents".

The problem is that I can only pass one index directory as an argument to
lucene.vector, and if I had two "SolrCores"/"shards" I would have two
"index" directories.

I can even cluster the data that happened to be in one of those "index"
directories, but not all the data in both (the complete index).
I tried to pass the two directories to lucene.vector, I also tried to
create both vectors and pass the directory in which they were to kmeans
instead of passing the vector file directly ...but I always got an error.

I don't know if it's possible to "merge" two vectors, or extract in some
way a vector from the whole distributed index or "export" the indices in
some format that can then be converted to a format Mahout supports...
whatever that can be done may help... Is there anything that can be done?




I'm really a newbie with Mahout and Solr, and I know that some of the
things I wrote will sound as silly as a newbie many times sounds...

So, many thanks for your patience and help! :)


Sebastián Ramírez

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Reply via email to