Re: Extending mahout lucene.vector driver

Frank Scholten Tue, 31 Jan 2012 23:51:58 -0800

Hi Michael,

Can you compare what you are doing with the code from the
testRun_query() unit test in LuceneIndexToSequenceFilesTest? The unit
test works, so I am curious where there is a difference.


Cheers,

Frank

On Fri, Jan 27, 2012 at 4:41 PM, Michael Kazekin
<[email protected]> wrote:
> Frank, I tried this code with Solr 3.5 index (and changed all dependencies
> in pom file), but this still doesn't work:
>
> Directory directory = FSDirectory.open(file);
> IndexReader reader = IndexReader.open(directory, true);
> IndexSearcher searcher = new IndexSearcher(reader);
>
> I try to get a Scorer with this TermQuery ("lang" field is indexed
> and stored and all data is available)
>
> TermQuery atomQuery = new TermQuery(new Term("lang", "ru"));
>
> Weight weight = atomQuery.createWeight(searcher);
> Scorer scorer = weight.scorer(reader, true, false);
>
> // scorer == null here
>
>
>
>
> On 01/25/2012 07:04 PM, Frank Scholten wrote:
>>
>> Are you using Lucene 3.4? I had this problem as well and I believe
>> this was because of https://issues.apache.org/jira/browse/LUCENE-3442
>> which is fixed in Lucene 3.5.
>>
>> On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin
>> <[email protected]>  wrote:
>>>
>>> Frank, I tried to use BooleanQuery, comprising of several TermQueries
>>> (these
>>> represent key:value constraints, where key is the field name, for example
>>> "lang:en"),
>>> but the Scorer, created by Weight in your code, is null. Do you know,
>>> what
>>> could be wrong here?
>>>
>>> Sorry to bother you on dev list with such questions, but I am trying to
>>> make
>>> a CLI util for this code, so I think it would be helpful for everybody.
>>
>> Great! Let me know if you need more help.
>>
>> Cheers,
>>
>> Frank
>>
>>>
>>> On 01/20/2012 02:15 AM, Frank Scholten wrote:
>>>>
>>>> LuceneIndexToSequenceFiles lucene2Seq = new
>>>> LuceneIndexToSequenceFiles();
>>>>
>>>> Configuration configuration = ... ;
>>>> IndexDirectory indexDirectory = ... ;
>>>> Path seqPath = ... ;
>>>> String idField = ... ;
>>>> String field = ... ;
>>>> List<String>    extraFields = asList( ... );
>>>> Query query = ... ;
>>>>
>>>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
>>>> LuceneIndexToSequenceFilesConfiguration(configuration,
>>>> indexDirectory.getFile(), seqPath, idField, field);
>>>> lucene2SeqConf.setExtraFields(extraFields);
>>>> lucene2SeqConf.setQuery(query);
>>>>
>>>> lucene2Seq.run(lucene2SeqConf);
>>>>
>>>> The seqPath variable can be passed into seq2sparse.
>>>>
>>>> Cheers,
>>>>
>>>> Frank
>>>>
>>>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
>>>> <[email protected]>    wrote:
>>>>>
>>>>> Frank, could you please tell me how to use your lucene2seq tool?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>>>>
>>>>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>>>>
>>>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>>>>> <[email protected]>      wrote:
>>>>>>>
>>>>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>>>>
>>>>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>>>>> tasks
>>>>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>>>>> results only, while clusterization tasks need to deal with the
>>>>>>> results
>>>>>>> as
>>>>>>> a
>>>>>>> whole.
>>>>>>>
>>>>>>>
>>>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944
>>>>>>>>
>>>>>>>>
>>>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query
>>>>>>>> and
>>>>>>>> it generates text sequence files.
>>>>>>>>
>>>>>>>>  From there you can use seq2sparse.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Frank
>>>>>>>>
>>>>>>>> Sorry for brevity, sent from phone
>>>>>>>>
>>>>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>>>>> Kazekin<[email protected]>        wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can
>>>>>>>>> be
>>>>>>>>> feeded with arbitrary
>>>>>>>>> key-value constraints on solr schema fields (and generate only a
>>>>>>>>> subset
>>>>>>>>> for
>>>>>>>>> mahout vectors,
>>>>>>>>> which seems to be a regular use case).
>>>>>>>>>
>>>>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>>>>> implementation
>>>>>>>>> that would allow
>>>>>>>>> to read the subset.
>>>>>>>>>
>>>>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>>>>
>>>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem,
>>>>>>>>> but
>>>>>>>>> I
>>>>>>>>> don't know which
>>>>>>>>> methods to override to get a consistent object representation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The driver code includes the following:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>>>>
>>>>>>>>>    Weight weight;
>>>>>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>>>>      weight = new TF();
>>>>>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>>>>      weight = new TFIDF();
>>>>>>>>>    } else {
>>>>>>>>>      throw new IllegalArgumentException("Weight type " + weightType
>>>>>>>>> +
>>>>>>>>> "
>>>>>>>>> is
>>>>>>>>> not supported");
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>>>>> maxDFPercent);
>>>>>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>>>>
>>>>>>>>>    LuceneIterable iterable;
>>>>>>>>>
>>>>>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>>>>    } else {
>>>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>> norm,
>>>>>>>>> maxPercentErrorDocs);
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It creates a SequenceFile.Writer class then and writes the
>>>>>>>>> "iterable"
>>>>>>>>> variable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>>>>> way?
>>>>>>>>>
>>>
>>
>

Re: Extending mahout lucene.vector driver

Reply via email to