Hi Michael,

Checkout https://issues.apache.org/jira/browse/MAHOUT-944

This is a lucene2seq tool. You can pass in fields and a lucene query and it 
generates text sequence files.

From there you can use seq2sparse.

Cheers,

Frank

Sorry for brevity, sent from phone

On Jan 17, 2012, at 17:37, Michael Kazekin <[email protected]> 
wrote:

> Hi!
> 
> I am trying to extend "mahout lucene.vector" driver, so that it can be
> feeded with arbitrary
> key-value constraints on solr schema fields (and generate only a subset for
> mahout vectors,
> which seems to be a regular use case).
> 
> So the best (easiest) way I see, is to create an IndexReader implementation
> that would allow
> to read the subset.
> 
> The problem is that I don't know the correct way to do this.
> 
> Maybe, subclassing the FilterIndexReader would solve the problem, but I
> don't know which
> methods to override to get a consistent object representation.
> 
> 
> 
> The driver code includes the following:
> 
> 
> 
> IndexReader reader = IndexReader.open(dir, true);
> 
>    Weight weight;
>    if ("tf".equalsIgnoreCase(weightType)) {
>      weight = new TF();
>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>      weight = new TFIDF();
>    } else {
>      throw new IllegalArgumentException("Weight type " + weightType + " is
> not supported");
>    }
> 
>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
> maxDFPercent);
>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
> 
>    LuceneIterable iterable;
> 
>    if (norm == LuceneIterable.NO_NORMALIZING) {
>      iterable = new LuceneIterable(reader, idField, field, mapper,
> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>    } else {
>      iterable = new LuceneIterable(reader, idField, field, mapper, norm,
> maxPercentErrorDocs);
>    }
> 
> 
> 
> 
> It creates a SequenceFile.Writer class then and writes the "iterable"
> variable.
> 
> 
> Do you have any thoughts on how to inject the code in a most simple way?
> 

Reply via email to