Jukka Zitting wrote:
Hi,

Quick summary of the Tika discussions from yesterday's text analysis
BOF at the ApacheCon EU. It's the next morning now, so I'm probably
missing a lot of stuff...

One other thing that we discussed was that it would make sense for some input formats (such as html) if Tika could produce output that allows mapping back to the input. In other words, it should be possible (optionally) to know for each character in the output text where this character originated in the input. This is useful, for example, for result highlighting.

This may not be something for the early releases, but it would be good if we could keep this option in the back of our heads when designing the interfaces.

--Thilo

Reply via email to