[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

Karl Wettin (JIRA) Thu, 19 Jul 2007 12:45:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513983
 ]


Karl Wettin commented on LUCENE-868:
------------------------------------

Sorry for the delay, vacation time.

In short I think this is a really nice improvment to the API. I also agree with 
Yonik about the array[]s constructed and passed down to the mapper. Perhaps 
your current implementation could be moved one layer further up? Another 
thought is to reuse array(s) and pass on the data length, but that might just 
complicate things.

I'll try to introduce these things next week and see how well it works. 

I use the term vectors for text classification. For each new classifier 
introduced (occurs quite a lot) I iterate the corpus and classify the 
documents. Potentially it could save me quite a bit of ticks and bits to not 
create all them array[]s, however my gut tells me there might be some JVM 
settings that does the same trick. I'll have to look in to that.



> Making Term Vectors more accessible
> -----------------------------------
>
>                 Key: LUCENE-868
>                 URL: https://issues.apache.org/jira/browse/LUCENE-868
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

Reply via email to