LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles(); Configuration configuration = ... ; IndexDirectory indexDirectory = ... ; Path seqPath = ... ; String idField = ... ; String field = ... ; List<String> extraFields = asList( ... ); Query query = ... ;
LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new LuceneIndexToSequenceFilesConfiguration(configuration, indexDirectory.getFile(), seqPath, idField, field); lucene2SeqConf.setExtraFields(extraFields); lucene2SeqConf.setQuery(query); lucene2Seq.run(lucene2SeqConf); The seqPath variable can be passed into seq2sparse. Cheers, Frank On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin <[email protected]> wrote: > Frank, could you please tell me how to use your lucene2seq tool? > > > > > On 01/18/2012 04:57 PM, Frank Scholten wrote: >> >> You can use a MatchAllDocsQuery if you want to fetch all documents. >> >> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin >> <[email protected]> wrote: >>> >>> Thank you, Frank! I'll definitely have a look on it. >>> >>> As far as I can see, the problem with using Lucene in clusterisation >>> tasks >>> is that even with queries you get access to the "tip-of-the-iceberg" >>> results only, while clusterization tasks need to deal with the results as >>> a >>> whole. >>> >>> >>> On 01/17/2012 09:56 PM, Frank Scholten wrote: >>>> >>>> Hi Michael, >>>> >>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944 >>>> >>>> This is a lucene2seq tool. You can pass in fields and a lucene query and >>>> it generates text sequence files. >>>> >>>> From there you can use seq2sparse. >>>> >>>> Cheers, >>>> >>>> Frank >>>> >>>> Sorry for brevity, sent from phone >>>> >>>> On Jan 17, 2012, at 17:37, Michael >>>> Kazekin<[email protected]> wrote: >>>> >>>>> Hi! >>>>> >>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be >>>>> feeded with arbitrary >>>>> key-value constraints on solr schema fields (and generate only a subset >>>>> for >>>>> mahout vectors, >>>>> which seems to be a regular use case). >>>>> >>>>> So the best (easiest) way I see, is to create an IndexReader >>>>> implementation >>>>> that would allow >>>>> to read the subset. >>>>> >>>>> The problem is that I don't know the correct way to do this. >>>>> >>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I >>>>> don't know which >>>>> methods to override to get a consistent object representation. >>>>> >>>>> >>>>> >>>>> The driver code includes the following: >>>>> >>>>> >>>>> >>>>> IndexReader reader = IndexReader.open(dir, true); >>>>> >>>>> Weight weight; >>>>> if ("tf".equalsIgnoreCase(weightType)) { >>>>> weight = new TF(); >>>>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>>>> weight = new TFIDF(); >>>>> } else { >>>>> throw new IllegalArgumentException("Weight type " + weightType + " >>>>> is >>>>> not supported"); >>>>> } >>>>> >>>>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>>>> maxDFPercent); >>>>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>>>> >>>>> LuceneIterable iterable; >>>>> >>>>> if (norm == LuceneIterable.NO_NORMALIZING) { >>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>>>> } else { >>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>> norm, >>>>> maxPercentErrorDocs); >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> It creates a SequenceFile.Writer class then and writes the "iterable" >>>>> variable. >>>>> >>>>> >>>>> Do you have any thoughts on how to inject the code in a most simple >>>>> way? >>>>> >> >
