Re: Bypassing ExtractingRequestHandler

2016-06-14 Thread Chris Mattmann
Hey Tim sounds great to me..

—
Chris Mattmann
chris.mattm...@gmail.com







On 6/14/16, 8:53 AM, "Allison, Timothy B."  wrote:

>Oh, wow.  Y, that's probably more than we'd want to support (unless any other 
>Tika devs have an interest?)...very, very cool!
>
>
>-Original Message-
>From: Justin Lee [mailto:lee.justi...@gmail.com] 
>Sent: Monday, June 13, 2016 5:05 PM
>To: solr-u...@lucene.apache.org
>Subject: Re: Bypassing ExtractingRequestHandler
>
>Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me. 
> The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll 
>revisit after some time.
>
>Tim: for context, I'm ultimately trying to create an external highlighter.
>See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the 
>bounding box (in PDF units) for each token in the extracted text stream.
>Then when I get results from Solr using the above patch, I'll convert the
>UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in 
>the UI.  I like this approach because I get highlighting that accurately 
>reflects the search, even when the search is complex (e.g. wildcards or 
>proximity searches).
>
>I think it would take quite a bit of thinking to get something general enough 
>to add into Tika.  For example, what units?  Take a look at the discussion of 
>what units to report offsets in here:
>https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert 
>Muir -- although whatever issues there are here they are the same as the 
>offsets reported in the Term Vector Component, it would seem to me).  As 
>another example, I'm just not sure what format is general enough to make sense 
>for everybody.  I think I'll just create a mapping from UTF-16 offsets into 
>(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL 
>store.  Then, when I get Solr results, I'll look at the matching offsets, the 
>JSON blob, and the original document and be on my merry way.  I'm happy to 
>open a JIRA entry in Tika if you think this is a coherent request.
>
>The other approach, I suppose, is to try to pass the information along during 
>indexing and store as a token payload.  But it seems like the indexing 
>interface is really text oriented.  I have also thought about using 
>DelimitedPayloadTokenFilter, which will increase the index size I imagine (how 
>much, though?) and require more customization of Solr internals.  I don't know 
>which is the better approach.
>
>On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. 
>wrote:
>
>>
>>
>>
>> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
>> >stuff
>> should be straightforward:
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> +1
>>
>> > We tend to prefer running Tika externally as it's entirely possible 
>> > that Tika will crash or hang with certain files - and that will 
>> > bring down Solr if you're running Tika within it.
>>
>> +1
>>
>> >> I want to make a small modification to Tika to get and save 
>> >> additional data from my PDFs
>> What info do you need, and if it is common enough, could you ask over 
>> on Tika's JIRA and we'll try to add it directly?
>>
>>
>>
>>



RE: Bypassing ExtractingRequestHandler

2016-06-14 Thread Allison, Timothy B.
Oh, wow.  Y, that's probably more than we'd want to support (unless any other 
Tika devs have an interest?)...very, very cool!


-Original Message-
From: Justin Lee [mailto:lee.justi...@gmail.com] 
Sent: Monday, June 13, 2016 5:05 PM
To: solr-u...@lucene.apache.org
Subject: Re: Bypassing ExtractingRequestHandler

Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  
The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit 
after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the 
bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in 
the UI.  I like this approach because I get highlighting that accurately 
reflects the search, even when the search is complex (e.g. wildcards or 
proximity searches).

I think it would take quite a bit of thinking to get something general enough 
to add into Tika.  For example, what units?  Take a look at the discussion of 
what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert 
Muir -- although whatever issues there are here they are the same as the 
offsets reported in the Term Vector Component, it would seem to me).  As 
another example, I'm just not sure what format is general enough to make sense 
for everybody.  I think I'll just create a mapping from UTF-16 offsets into 
(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL 
store.  Then, when I get Solr results, I'll look at the matching offsets, the 
JSON blob, and the original document and be on my merry way.  I'm happy to open 
a JIRA entry in Tika if you think this is a coherent request.

The other approach, I suppose, is to try to pass the information along during 
indexing and store as a token payload.  But it seems like the indexing 
interface is really text oriented.  I have also thought about using 
DelimitedPayloadTokenFilter, which will increase the index size I imagine (how 
much, though?) and require more customization of Solr internals.  I don't know 
which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. 
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
> >stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible 
> > that Tika will crash or hang with certain files - and that will 
> > bring down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification to Tika to get and save 
> >> additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over 
> on Tika's JIRA and we'll try to add it directly?
>
>
>
>