Are you using Lucene 3.4? I had this problem as well and I believe this was because of https://issues.apache.org/jira/browse/LUCENE-3442 which is fixed in Lucene 3.5.
On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin <[email protected]> wrote: > Frank, I tried to use BooleanQuery, comprising of several TermQueries (these > represent key:value constraints, where key is the field name, for example > "lang:en"), > but the Scorer, created by Weight in your code, is null. Do you know, what > could be wrong here? > > Sorry to bother you on dev list with such questions, but I am trying to make > a CLI util for this code, so I think it would be helpful for everybody. Great! Let me know if you need more help. Cheers, Frank > > > On 01/20/2012 02:15 AM, Frank Scholten wrote: >> >> LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles(); >> >> Configuration configuration = ... ; >> IndexDirectory indexDirectory = ... ; >> Path seqPath = ... ; >> String idField = ... ; >> String field = ... ; >> List<String> extraFields = asList( ... ); >> Query query = ... ; >> >> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new >> LuceneIndexToSequenceFilesConfiguration(configuration, >> indexDirectory.getFile(), seqPath, idField, field); >> lucene2SeqConf.setExtraFields(extraFields); >> lucene2SeqConf.setQuery(query); >> >> lucene2Seq.run(lucene2SeqConf); >> >> The seqPath variable can be passed into seq2sparse. >> >> Cheers, >> >> Frank >> >> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin >> <[email protected]> wrote: >>> >>> Frank, could you please tell me how to use your lucene2seq tool? >>> >>> >>> >>> >>> On 01/18/2012 04:57 PM, Frank Scholten wrote: >>>> >>>> You can use a MatchAllDocsQuery if you want to fetch all documents. >>>> >>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin >>>> <[email protected]> wrote: >>>>> >>>>> Thank you, Frank! I'll definitely have a look on it. >>>>> >>>>> As far as I can see, the problem with using Lucene in clusterisation >>>>> tasks >>>>> is that even with queries you get access to the "tip-of-the-iceberg" >>>>> results only, while clusterization tasks need to deal with the results >>>>> as >>>>> a >>>>> whole. >>>>> >>>>> >>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote: >>>>>> >>>>>> Hi Michael, >>>>>> >>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944 >>>>>> >>>>>> >>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query >>>>>> and >>>>>> it generates text sequence files. >>>>>> >>>>>> From there you can use seq2sparse. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Frank >>>>>> >>>>>> Sorry for brevity, sent from phone >>>>>> >>>>>> On Jan 17, 2012, at 17:37, Michael >>>>>> Kazekin<[email protected]> wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can >>>>>>> be >>>>>>> feeded with arbitrary >>>>>>> key-value constraints on solr schema fields (and generate only a >>>>>>> subset >>>>>>> for >>>>>>> mahout vectors, >>>>>>> which seems to be a regular use case). >>>>>>> >>>>>>> So the best (easiest) way I see, is to create an IndexReader >>>>>>> implementation >>>>>>> that would allow >>>>>>> to read the subset. >>>>>>> >>>>>>> The problem is that I don't know the correct way to do this. >>>>>>> >>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but >>>>>>> I >>>>>>> don't know which >>>>>>> methods to override to get a consistent object representation. >>>>>>> >>>>>>> >>>>>>> >>>>>>> The driver code includes the following: >>>>>>> >>>>>>> >>>>>>> >>>>>>> IndexReader reader = IndexReader.open(dir, true); >>>>>>> >>>>>>> Weight weight; >>>>>>> if ("tf".equalsIgnoreCase(weightType)) { >>>>>>> weight = new TF(); >>>>>>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>>>>>> weight = new TFIDF(); >>>>>>> } else { >>>>>>> throw new IllegalArgumentException("Weight type " + weightType + >>>>>>> " >>>>>>> is >>>>>>> not supported"); >>>>>>> } >>>>>>> >>>>>>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>>>>>> maxDFPercent); >>>>>>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>>>>>> >>>>>>> LuceneIterable iterable; >>>>>>> >>>>>>> if (norm == LuceneIterable.NO_NORMALIZING) { >>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>>>>>> } else { >>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>> norm, >>>>>>> maxPercentErrorDocs); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> It creates a SequenceFile.Writer class then and writes the "iterable" >>>>>>> variable. >>>>>>> >>>>>>> >>>>>>> Do you have any thoughts on how to inject the code in a most simple >>>>>>> way? >>>>>>> > >
