Re: Extending mahout lucene.vector driver

Frank Scholten Wed, 25 Jan 2012 07:04:59 -0800

Are you using Lucene 3.4? I had this problem as well and I believe
this was because of https://issues.apache.org/jira/browse/LUCENE-3442
which is fixed in Lucene 3.5.


On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin
<[email protected]> wrote:
> Frank, I tried to use BooleanQuery, comprising of several TermQueries (these
> represent key:value constraints, where key is the field name, for example
> "lang:en"),
> but the Scorer, created by Weight in your code, is null. Do you know, what
> could be wrong here?
>
> Sorry to bother you on dev list with such questions, but I am trying to make
> a CLI util for this code, so I think it would be helpful for everybody.

Great! Let me know if you need more help.

Cheers,

Frank

>
>
> On 01/20/2012 02:15 AM, Frank Scholten wrote:
>>
>> LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
>>
>> Configuration configuration = ... ;
>> IndexDirectory indexDirectory = ... ;
>> Path seqPath = ... ;
>> String idField = ... ;
>> String field = ... ;
>> List<String>  extraFields = asList( ... );
>> Query query = ... ;
>>
>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
>> LuceneIndexToSequenceFilesConfiguration(configuration,
>> indexDirectory.getFile(), seqPath, idField, field);
>> lucene2SeqConf.setExtraFields(extraFields);
>> lucene2SeqConf.setQuery(query);
>>
>> lucene2Seq.run(lucene2SeqConf);
>>
>> The seqPath variable can be passed into seq2sparse.
>>
>> Cheers,
>>
>> Frank
>>
>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
>> <[email protected]>  wrote:
>>>
>>> Frank, could you please tell me how to use your lucene2seq tool?
>>>
>>>
>>>
>>>
>>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>>
>>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>>
>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>>> <[email protected]>    wrote:
>>>>>
>>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>>
>>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>>> tasks
>>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>>> results only, while clusterization tasks need to deal with the results
>>>>> as
>>>>> a
>>>>> whole.
>>>>>
>>>>>
>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944
>>>>>>
>>>>>>
>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query
>>>>>> and
>>>>>> it generates text sequence files.
>>>>>>
>>>>>>  From there you can use seq2sparse.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Frank
>>>>>>
>>>>>> Sorry for brevity, sent from phone
>>>>>>
>>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>>> Kazekin<[email protected]>      wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can
>>>>>>> be
>>>>>>> feeded with arbitrary
>>>>>>> key-value constraints on solr schema fields (and generate only a
>>>>>>> subset
>>>>>>> for
>>>>>>> mahout vectors,
>>>>>>> which seems to be a regular use case).
>>>>>>>
>>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>>> implementation
>>>>>>> that would allow
>>>>>>> to read the subset.
>>>>>>>
>>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>>
>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but
>>>>>>> I
>>>>>>> don't know which
>>>>>>> methods to override to get a consistent object representation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The driver code includes the following:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>>
>>>>>>>    Weight weight;
>>>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>>      weight = new TF();
>>>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>>      weight = new TFIDF();
>>>>>>>    } else {
>>>>>>>      throw new IllegalArgumentException("Weight type " + weightType +
>>>>>>> "
>>>>>>> is
>>>>>>> not supported");
>>>>>>>    }
>>>>>>>
>>>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>>> maxDFPercent);
>>>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>>
>>>>>>>    LuceneIterable iterable;
>>>>>>>
>>>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>>    } else {
>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>> norm,
>>>>>>> maxPercentErrorDocs);
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It creates a SequenceFile.Writer class then and writes the "iterable"
>>>>>>> variable.
>>>>>>>
>>>>>>>
>>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>>> way?
>>>>>>>
>
>

Re: Extending mahout lucene.vector driver

Reply via email to