This sounds like a Lucene query. There are a lot of Lucene coding
resources, including 2 revisions of the book Lucene In Action.

On Thu, Jan 19, 2012 at 2:15 PM, Frank Scholten <[email protected]> wrote:
> LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
>
> Configuration configuration = ... ;
> IndexDirectory indexDirectory = ... ;
> Path seqPath = ... ;
> String idField = ... ;
> String field = ... ;
> List<String> extraFields = asList( ... );
> Query query = ... ;
>
> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
> LuceneIndexToSequenceFilesConfiguration(configuration,
> indexDirectory.getFile(), seqPath, idField, field);
> lucene2SeqConf.setExtraFields(extraFields);
> lucene2SeqConf.setQuery(query);
>
> lucene2Seq.run(lucene2SeqConf);
>
> The seqPath variable can be passed into seq2sparse.
>
> Cheers,
>
> Frank
>
> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
> <[email protected]> wrote:
>> Frank, could you please tell me how to use your lucene2seq tool?
>>
>>
>>
>>
>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>
>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>
>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>> <[email protected]>  wrote:
>>>>
>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>
>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>> tasks
>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>> results only, while clusterization tasks need to deal with the results as
>>>> a
>>>> whole.
>>>>
>>>>
>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Checkout https://issues.apache.org/jira/browse/MAHOUT-944
>>>>>
>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query and
>>>>> it generates text sequence files.
>>>>>
>>>>>  From there you can use seq2sparse.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Frank
>>>>>
>>>>> Sorry for brevity, sent from phone
>>>>>
>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>> Kazekin<[email protected]>    wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can be
>>>>>> feeded with arbitrary
>>>>>> key-value constraints on solr schema fields (and generate only a subset
>>>>>> for
>>>>>> mahout vectors,
>>>>>> which seems to be a regular use case).
>>>>>>
>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>> implementation
>>>>>> that would allow
>>>>>> to read the subset.
>>>>>>
>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>
>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem, but I
>>>>>> don't know which
>>>>>> methods to override to get a consistent object representation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The driver code includes the following:
>>>>>>
>>>>>>
>>>>>>
>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>
>>>>>>    Weight weight;
>>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>      weight = new TF();
>>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>      weight = new TFIDF();
>>>>>>    } else {
>>>>>>      throw new IllegalArgumentException("Weight type " + weightType + "
>>>>>> is
>>>>>> not supported");
>>>>>>    }
>>>>>>
>>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>> maxDFPercent);
>>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>
>>>>>>    LuceneIterable iterable;
>>>>>>
>>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>    } else {
>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>> norm,
>>>>>> maxPercentErrorDocs);
>>>>>>    }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> It creates a SequenceFile.Writer class then and writes the "iterable"
>>>>>> variable.
>>>>>>
>>>>>>
>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>> way?
>>>>>>
>>>
>>



-- 
Lance Norskog
[email protected]

Reply via email to