Hey Tim sounds great to me..
—
Chris Mattmann
chris.mattm...@gmail.com
On 6/14/16, 8:53 AM, "Allison, Timothy B." wrote:
>Oh, wow. Y, that's probably more than we'd want to support (unless any other
>Tika devs have an interest?)...very, very cool!
>
>
>-Original Message-
>From: Justin Lee [mailto:lee.justi...@gmail.com]
>Sent: Monday, June 13, 2016 5:05 PM
>To: solr-u...@lucene.apache.org
>Subject: Re: Bypassing ExtractingRequestHandler
>
>Thanks everyone for the help and advice. The SolrJ exmaple makes sense to me.
> The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll
>revisit after some time.
>
>Tim: for context, I'm ultimately trying to create an external highlighter.
>See https://issues.apache.org/jira/browse/SOLR-1397. I want to store the
>bounding box (in PDF units) for each token in the extracted text stream.
>Then when I get results from Solr using the above patch, I'll convert the
>UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in
>the UI. I like this approach because I get highlighting that accurately
>reflects the search, even when the search is complex (e.g. wildcards or
>proximity searches).
>
>I think it would take quite a bit of thinking to get something general enough
>to add into Tika. For example, what units? Take a look at the discussion of
>what units to report offsets in here:
>https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert
>Muir -- although whatever issues there are here they are the same as the
>offsets reported in the Term Vector Component, it would seem to me). As
>another example, I'm just not sure what format is general enough to make sense
>for everybody. I think I'll just create a mapping from UTF-16 offsets into
>(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL
>store. Then, when I get Solr results, I'll look at the matching offsets, the
>JSON blob, and the original document and be on my merry way. I'm happy to
>open a JIRA entry in Tika if you think this is a coherent request.
>
>The other approach, I suppose, is to try to pass the information along during
>indexing and store as a token payload. But it seems like the indexing
>interface is really text oriented. I have also thought about using
>DelimitedPayloadTokenFilter, which will increase the index size I imagine (how
>much, though?) and require more customization of Solr internals. I don't know
>which is the better approach.
>
>On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B.
>wrote:
>
>>
>>
>>
>> >Two things: Here's a sample bit of SolrJ code, pulling out the DB
>> >stuff
>> should be straightforward:
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> +1
>>
>> > We tend to prefer running Tika externally as it's entirely possible
>> > that Tika will crash or hang with certain files - and that will
>> > bring down Solr if you're running Tika within it.
>>
>> +1
>>
>> >> I want to make a small modification to Tika to get and save
>> >> additional data from my PDFs
>> What info do you need, and if it is common enough, could you ask over
>> on Tika's JIRA and we'll try to add it directly?
>>
>>
>>
>>