Expose characters offsets information while parsing text-based inputs.
----------------------------------------------------------------------
Key: TIKA-272
URL: https://issues.apache.org/jira/browse/TIKA-272
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 0.4
Reporter: David Causse
Priority: Minor
It would be interesting to access actual characters offset information when
parsing text-based files (I don't know if it's interesting/usable/doable for
binary formats...).
If I use tika for parsing HTML and inject parsed strings into lucene, I'm not
able to tell to the lucene analyzer where is the actual character in the
original input.
If tika expose this information It will permit to use unmodified lucene
analyzers behind tika and implement for example pretty highlighting in search
result (see google cache view).
With new Lucene Attribute API it could be fairly easy to provide a sort of
TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika ->
unmodified lucene analyzer -> tika offset correction.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.