[jira] Created: (TIKA-272) Expose characters offsets information while parsing text-based inputs.

David Causse (JIRA) Fri, 04 Sep 2009 01:25:25 -0700

Expose characters offsets information while parsing text-based inputs.
----------------------------------------------------------------------


                 Key: TIKA-272
                 URL: https://issues.apache.org/jira/browse/TIKA-272
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.4
            Reporter: David Causse
            Priority: Minor


It would be interesting to access actual characters offset information when 
parsing text-based files (I don't know if it's interesting/usable/doable for 
binary formats...).
If I use tika for parsing HTML and inject parsed strings into lucene, I'm not 
able to tell to the lucene analyzer where is the actual character in the 
original input.
If tika expose this information It will permit to use unmodified lucene 
analyzers behind tika and implement for example pretty highlighting in search 
result (see google cache view).
With new Lucene Attribute API it could be fairly easy to provide a sort of 
TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> 
unmodified lucene analyzer -> tika offset correction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (TIKA-272) Expose characters offsets information while parsing text-based inputs.

Reply via email to