Hello everyone, I want to know if it's possible to do a clustering of documents in SolrCloud indices (multiple "index" directories) and how would one accomplish that.
--- I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT I can cluster documents from one Lucene/Solr index. I can even cluster documents from a Solr 4.x index (the same version that implements the distributed SolrCloud). As I know, SorlCloud uses indices distributed in files across "shards" as one big index. The problem is that although I can cluster documents from one index, from one "shard"/"SolrCore", I can't cluster the documents from the whole index. ...Or at least, I don't know how to do it. I used Mahout with the lucene.vector tool, it gets one index directory and outputs a "vector" file (if I'm not wrong) and a text "dictionary". Then I can use Mahout with, for example, kmeans to cluster the "documents". The problem is that I can only pass one index directory as an argument to lucene.vector, and if I had two "SolrCores"/"shards" I would have two "index" directories. I can even cluster the data that happened to be in one of those "index" directories, but not all the data in both (the complete index). I tried to pass the two directories to lucene.vector, I also tried to create both vectors and pass the directory in which they were to kmeans instead of passing the vector file directly ...but I always got an error. I don't know if it's possible to "merge" two vectors, or extract in some way a vector from the whole distributed index or "export" the indices in some format that can then be converted to a format Mahout supports... whatever that can be done may help... Is there anything that can be done? I'm really a newbie with Mahout and Solr, and I know that some of the things I wrote will sound as silly as a newbie many times sounds... So, many thanks for your patience and help! :) Sebastián Ramírez -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*