Hi Michael, Checkout https://issues.apache.org/jira/browse/MAHOUT-944
This is a lucene2seq tool. You can pass in fields and a lucene query and it generates text sequence files. From there you can use seq2sparse. Cheers, Frank Sorry for brevity, sent from phone On Jan 17, 2012, at 17:37, Michael Kazekin <[email protected]> wrote: > Hi! > > I am trying to extend "mahout lucene.vector" driver, so that it can be > feeded with arbitrary > key-value constraints on solr schema fields (and generate only a subset for > mahout vectors, > which seems to be a regular use case). > > So the best (easiest) way I see, is to create an IndexReader implementation > that would allow > to read the subset. > > The problem is that I don't know the correct way to do this. > > Maybe, subclassing the FilterIndexReader would solve the problem, but I > don't know which > methods to override to get a consistent object representation. > > > > The driver code includes the following: > > > > IndexReader reader = IndexReader.open(dir, true); > > Weight weight; > if ("tf".equalsIgnoreCase(weightType)) { > weight = new TF(); > } else if ("tfidf".equalsIgnoreCase(weightType)) { > weight = new TFIDF(); > } else { > throw new IllegalArgumentException("Weight type " + weightType + " is > not supported"); > } > > TermInfo termInfo = new CachedTermInfo(reader, field, minDf, > maxDFPercent); > VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); > > LuceneIterable iterable; > > if (norm == LuceneIterable.NO_NORMALIZING) { > iterable = new LuceneIterable(reader, idField, field, mapper, > LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); > } else { > iterable = new LuceneIterable(reader, idField, field, mapper, norm, > maxPercentErrorDocs); > } > > > > > It creates a SequenceFile.Writer class then and writes the "iterable" > variable. > > > Do you have any thoughts on how to inject the code in a most simple way? >
