Well, I found a simple (maybe dirty) solution for my problem.

I write it here just for the record, as I understand these emails get
archived and can be accessible trough the web.

---

The solution is to merge the indices of the needed SolrCores that form the
SolrCloud system into one index, and then create the vector as normally
from that big merged index.

This solution may be suboptimal because if you have various really big
indices that you just can't merge, you would be out of hope. But if you can
afford merging the needed indices just to create the Mahout vector, then
this can work for you.


You just need to do the following, I tested it using Solr 4.2.1.

You need two files: "lucene-core-VERSION.jar" and "lucene-misc-VERSION.jar"
(where "VERSION" is your Lucene/Solr version, that would be for example "
lucene-core-4.2.1.jar"). Those files are in the Solr directory, under "
./example/solr-webapp/webapp/WEB-INF/lib/".

You can go to that directory, so "cd example/solr-webapp/webapp/WEB-INF/lib/
".

Then execute the following:

java -cp lucene-core-VERSION.jar:lucene-misc-VERSION.jar
org/apache/lucene/misc/IndexMergeTool
/path/to/newindex
/path/to/index1
/path/to/index2


Replacing "VERSION" with your Lucene/Solr version and those "/path/to/index"
to your real indices paths.

If you are using the Solr instance in the "example" directory, the index
path would be "./example/solr/collection1/data/index/".

In the "/path/to/newindex" path will be you newly merged index, from where
you can create your Mahout vector.

I made my solution based on this article:
http://docs.lucidworks.com/display/solr/Merging+Indexes


I hope this helps somebody somewhen too.

Best Regards,

Sebastián Ramírez



On Mon, Apr 22, 2013 at 9:35 PM, Sebastian Ramirez <
sebastian.rami...@senseta.com> wrote:

> Hello everyone,
>
> I want to know if it's possible to do a clustering of documents in
> SolrCloud indices (multiple "index" directories) and how would one
> accomplish that.
>
> ---
>
> I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT
>
> I can cluster documents from one Lucene/Solr index. I can even cluster
> documents from a Solr 4.x index (the same version that implements the
> distributed SolrCloud).
>
> As I know, SorlCloud uses indices distributed in files across "shards" as
> one big index.
>
> The problem is that although I can cluster documents from one index, from
> one "shard"/"SolrCore", I can't cluster the documents from the whole index.
> ...Or at least, I don't know how to do it.
>
> I used Mahout with the lucene.vector tool, it gets one index directory and
> outputs a "vector" file (if I'm not wrong) and a text "dictionary". Then I
> can use Mahout with, for example, kmeans to cluster the "documents".
>
> The problem is that I can only pass one index directory as an argument to
> lucene.vector, and if I had two "SolrCores"/"shards" I would have two
> "index" directories.
>
> I can even cluster the data that happened to be in one of those "index"
> directories, but not all the data in both (the complete index).
> I tried to pass the two directories to lucene.vector, I also tried to
> create both vectors and pass the directory in which they were to kmeans
> instead of passing the vector file directly ...but I always got an error.
>
> I don't know if it's possible to "merge" two vectors, or extract in some
> way a vector from the whole distributed index or "export" the indices in
> some format that can then be converted to a format Mahout supports...
> whatever that can be done may help... Is there anything that can be done?
>
>
>
>
> I'm really a newbie with Mahout and Solr, and I know that some of the
> things I wrote will sound as silly as a newbie many times sounds...
>
> So, many thanks for your patience and help! :)
>
>
> Sebastián Ramírez
>
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Reply via email to