Re: Extending mahout lucene.vector driver

Frank Scholten Thu, 19 Jan 2012 14:16:31 -0800

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();

Configuration configuration = ... ;
IndexDirectory indexDirectory = ... ;
Path seqPath = ... ;
String idField = ... ;
String field = ... ;
List<String> extraFields = asList( ... );
Query query = ... ;


LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
LuceneIndexToSequenceFilesConfiguration(configuration,
indexDirectory.getFile(), seqPath, idField, field);
lucene2SeqConf.setExtraFields(extraFields);
lucene2SeqConf.setQuery(query);

lucene2Seq.run(lucene2SeqConf);

The seqPath variable can be passed into seq2sparse.

Cheers,

Frank

On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
<[email protected]> wrote:
> Frank, could you please tell me how to use your lucene2seq tool?
>
>
>
>
> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>
>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>
>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>> <[email protected]>  wrote:
>>>
>>> Thank you, Frank! I'll definitely have a look on it.
>>>
>>> As far as I can see, the problem with using Lucene in clusterisation
>>> tasks
>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>> results only, while clusterization tasks need to deal with the results as
>>> a
>>> whole.
>>>
>>>
>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>
>>>> Hi Michael,
>>>>
>>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944
>>>>
>>>> This is a lucene2seq tool. You can pass in fields and a lucene query and
>>>> it generates text sequence files.
>>>>
>>>>  From there you can use seq2sparse.
>>>>
>>>> Cheers,
>>>>
>>>> Frank
>>>>
>>>> Sorry for brevity, sent from phone
>>>>
>>>> On Jan 17, 2012, at 17:37, Michael
>>>> Kazekin<[email protected]>    wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be
>>>>> feeded with arbitrary
>>>>> key-value constraints on solr schema fields (and generate only a subset
>>>>> for
>>>>> mahout vectors,
>>>>> which seems to be a regular use case).
>>>>>
>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>> implementation
>>>>> that would allow
>>>>> to read the subset.
>>>>>
>>>>> The problem is that I don't know the correct way to do this.
>>>>>
>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I
>>>>> don't know which
>>>>> methods to override to get a consistent object representation.
>>>>>
>>>>>
>>>>>
>>>>> The driver code includes the following:
>>>>>
>>>>>
>>>>>
>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>
>>>>>    Weight weight;
>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>      weight = new TF();
>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>      weight = new TFIDF();
>>>>>    } else {
>>>>>      throw new IllegalArgumentException("Weight type " + weightType + "
>>>>> is
>>>>> not supported");
>>>>>    }
>>>>>
>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>> maxDFPercent);
>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>
>>>>>    LuceneIterable iterable;
>>>>>
>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>    } else {
>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>> norm,
>>>>> maxPercentErrorDocs);
>>>>>    }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> It creates a SequenceFile.Writer class then and writes the "iterable"
>>>>> variable.
>>>>>
>>>>>
>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>> way?
>>>>>
>>
>

Re: Extending mahout lucene.vector driver

Reply via email to