Hi Michael, Can you compare what you are doing with the code from the testRun_query() unit test in LuceneIndexToSequenceFilesTest? The unit test works, so I am curious where there is a difference.
Cheers, Frank On Fri, Jan 27, 2012 at 4:41 PM, Michael Kazekin <[email protected]> wrote: > Frank, I tried this code with Solr 3.5 index (and changed all dependencies > in pom file), but this still doesn't work: > > Directory directory = FSDirectory.open(file); > IndexReader reader = IndexReader.open(directory, true); > IndexSearcher searcher = new IndexSearcher(reader); > > I try to get a Scorer with this TermQuery ("lang" field is indexed > and stored and all data is available) > > TermQuery atomQuery = new TermQuery(new Term("lang", "ru")); > > Weight weight = atomQuery.createWeight(searcher); > Scorer scorer = weight.scorer(reader, true, false); > > // scorer == null here > > > > > On 01/25/2012 07:04 PM, Frank Scholten wrote: >> >> Are you using Lucene 3.4? I had this problem as well and I believe >> this was because of https://issues.apache.org/jira/browse/LUCENE-3442 >> which is fixed in Lucene 3.5. >> >> On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin >> <[email protected]> wrote: >>> >>> Frank, I tried to use BooleanQuery, comprising of several TermQueries >>> (these >>> represent key:value constraints, where key is the field name, for example >>> "lang:en"), >>> but the Scorer, created by Weight in your code, is null. Do you know, >>> what >>> could be wrong here? >>> >>> Sorry to bother you on dev list with such questions, but I am trying to >>> make >>> a CLI util for this code, so I think it would be helpful for everybody. >> >> Great! Let me know if you need more help. >> >> Cheers, >> >> Frank >> >>> >>> On 01/20/2012 02:15 AM, Frank Scholten wrote: >>>> >>>> LuceneIndexToSequenceFiles lucene2Seq = new >>>> LuceneIndexToSequenceFiles(); >>>> >>>> Configuration configuration = ... ; >>>> IndexDirectory indexDirectory = ... ; >>>> Path seqPath = ... ; >>>> String idField = ... ; >>>> String field = ... ; >>>> List<String> extraFields = asList( ... ); >>>> Query query = ... ; >>>> >>>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new >>>> LuceneIndexToSequenceFilesConfiguration(configuration, >>>> indexDirectory.getFile(), seqPath, idField, field); >>>> lucene2SeqConf.setExtraFields(extraFields); >>>> lucene2SeqConf.setQuery(query); >>>> >>>> lucene2Seq.run(lucene2SeqConf); >>>> >>>> The seqPath variable can be passed into seq2sparse. >>>> >>>> Cheers, >>>> >>>> Frank >>>> >>>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin >>>> <[email protected]> wrote: >>>>> >>>>> Frank, could you please tell me how to use your lucene2seq tool? >>>>> >>>>> >>>>> >>>>> >>>>> On 01/18/2012 04:57 PM, Frank Scholten wrote: >>>>>> >>>>>> You can use a MatchAllDocsQuery if you want to fetch all documents. >>>>>> >>>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Thank you, Frank! I'll definitely have a look on it. >>>>>>> >>>>>>> As far as I can see, the problem with using Lucene in clusterisation >>>>>>> tasks >>>>>>> is that even with queries you get access to the "tip-of-the-iceberg" >>>>>>> results only, while clusterization tasks need to deal with the >>>>>>> results >>>>>>> as >>>>>>> a >>>>>>> whole. >>>>>>> >>>>>>> >>>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote: >>>>>>>> >>>>>>>> Hi Michael, >>>>>>>> >>>>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944 >>>>>>>> >>>>>>>> >>>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query >>>>>>>> and >>>>>>>> it generates text sequence files. >>>>>>>> >>>>>>>> From there you can use seq2sparse. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Frank >>>>>>>> >>>>>>>> Sorry for brevity, sent from phone >>>>>>>> >>>>>>>> On Jan 17, 2012, at 17:37, Michael >>>>>>>> Kazekin<[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi! >>>>>>>>> >>>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can >>>>>>>>> be >>>>>>>>> feeded with arbitrary >>>>>>>>> key-value constraints on solr schema fields (and generate only a >>>>>>>>> subset >>>>>>>>> for >>>>>>>>> mahout vectors, >>>>>>>>> which seems to be a regular use case). >>>>>>>>> >>>>>>>>> So the best (easiest) way I see, is to create an IndexReader >>>>>>>>> implementation >>>>>>>>> that would allow >>>>>>>>> to read the subset. >>>>>>>>> >>>>>>>>> The problem is that I don't know the correct way to do this. >>>>>>>>> >>>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, >>>>>>>>> but >>>>>>>>> I >>>>>>>>> don't know which >>>>>>>>> methods to override to get a consistent object representation. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The driver code includes the following: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> IndexReader reader = IndexReader.open(dir, true); >>>>>>>>> >>>>>>>>> Weight weight; >>>>>>>>> if ("tf".equalsIgnoreCase(weightType)) { >>>>>>>>> weight = new TF(); >>>>>>>>> } else if ("tfidf".equalsIgnoreCase(weightType)) { >>>>>>>>> weight = new TFIDF(); >>>>>>>>> } else { >>>>>>>>> throw new IllegalArgumentException("Weight type " + weightType >>>>>>>>> + >>>>>>>>> " >>>>>>>>> is >>>>>>>>> not supported"); >>>>>>>>> } >>>>>>>>> >>>>>>>>> TermInfo termInfo = new CachedTermInfo(reader, field, minDf, >>>>>>>>> maxDFPercent); >>>>>>>>> VectorMapper mapper = new TFDFMapper(reader, weight, termInfo); >>>>>>>>> >>>>>>>>> LuceneIterable iterable; >>>>>>>>> >>>>>>>>> if (norm == LuceneIterable.NO_NORMALIZING) { >>>>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs); >>>>>>>>> } else { >>>>>>>>> iterable = new LuceneIterable(reader, idField, field, mapper, >>>>>>>>> norm, >>>>>>>>> maxPercentErrorDocs); >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It creates a SequenceFile.Writer class then and writes the >>>>>>>>> "iterable" >>>>>>>>> variable. >>>>>>>>> >>>>>>>>> >>>>>>>>> Do you have any thoughts on how to inject the code in a most simple >>>>>>>>> way? >>>>>>>>> >>> >> >
