Hi Timothy,

As I understand it, Tika is integrated with Solr.  All my indexed
documents declare that they've been parsed by tika.  For the eml files
it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
files show: ||org.apache.tika.parser.pdf.PDFParser|

||

||

What do you mean by improving the output with "tika-eval?"  I confess I
don't completely understand how documents should be prepared for
indexing.  But with the eml docs, solr/tika seems to properly pull out
things like date, subject, to and from.  Other (so-called 'rich text') 
documents (like pdfs and Word-type), the metadata is not so useful, but
on the other hand, there's not much consistent structure to the
documents I have to deal with.

I may be missing something - am I?

Regards,

Terry


On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
> +1 to Charlie's guidance.
>
> And...
>
>> 60,000 documents, mostly pdfs and emails.
>> However, there's a premium on precision (and recall) in searches.
> Please, oh, please, no matter what you're using for content/text extraction 
> and/or OCR, run tika-eval[1] on the output to ensure that that you are 
> getting mostly language-y content out of your documents.  Ping us on the Tika 
> user's list if you have any questions.
>
> Bad text, bad search. 😊
>
> [1] https://wiki.apache.org/tika/TikaEval
>
> -----Original Message-----
> From: Charlie Hull [mailto:char...@flax.co.uk] 
> Sent: Tuesday, April 17, 2018 4:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Specialized Solr Application
>
> On 16/04/2018 19:48, Terry Steichen wrote:
>> I have from time-to-time posted questions to this list (and received 
>> very prompt and helpful responses).  But it seems that many of you are 
>> operating in a very different space from me.  The problems (and
>> lessons-learned) which I encounter are often very different from those 
>> that are reflected in exchanges with most other participants.
> Hi Terry,
>
> Sounds like a fascinating use case. We have some similar clients - small 
> scale law firms and publishers - who have taken advantage of Solr.
>
> One thing I would encourage you to do is to blog and/or talk about what 
> you've built. Lucene Revolution is worth applying to talk at and if you do 
> manage to get accepted - or if you go anyway - you'll meet lots of others 
> with similar challenges and come away with a huge amount of useful 
> information and contacts. Otherwise there are lots of smaller Meetup events 
> (we run the London, UK one).
>
> Don't assume just because some people here are describing their 350 billion 
> document learning-to-rank clustered monster that the small applications don't 
> matter - they really do, and the fact that they're possible to build at all 
> is a testament to the open source model and how we share information and tips.
>
> Cheers
>
> Charlie
>> So I thought it would be useful to describe what I'm about, and see if 
>> there are others out there with similar implementations (or interest 
>> in moving in that direction).  A sort of pay-forward.
>>
>> We (the Lakota Peoples Law Office) are a small public interest, pro 
>> bono law firm actively engaged in defending Native American North 
>> Dakota Water Protector clients against (ridiculously excessive) criminal 
>> charges.
>>
>> I have a small Solr (6.6.0) implementation - just one shard.  I'm 
>> using the cloud mode mainly to be able to implement access controls.  
>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 
>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with 
>> a total of about 60,000 documents, mostly pdfs and emails.  The 
>> indexed documents are partly our own files and partly those we obtain 
>> through legal discovery (which, surprisingly, is allowed in ND for 
>> criminal cases).  We only have a few users (our lawyers and a couple 
>> of researchers mostly), so traffic is minimal.  However, there's a 
>> premium on precision (and recall) in searches.
>>
>> The document repository is local to the server.  I piggyback on the 
>> embedded Jetty httpd in order to serve files (selected from the 
>> hitlists).  I just use a symbolic link to tie the repository to 
>> Solr/Jetty's "webapp" subdirectory.
>>
>> We provide remote access via ssh with port forwarding.  It provides 
>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>
>> I've had some bizarre behavior apparently caused by an interaction 
>> between repository permissions, solr permissions and the ssh link.  I 
>> seem "solved" for the moment, but time will tell for how long.
>>
>> If there are any folks out there who have similar requirements, I'd be 
>> more than happy to share the insights I've gained and problems I've 
>> encountered and (I think) overcome.  There are so many unique parts of 
>> this small scale, specialized application (many dimensions of which 
>> are not strictly internal to Solr) that it probably won't be 
>> appreciated to dump them on this (excellent) Solr list.  So, if you 
>> encounter problems peculiar to this kind of setup, we can perhaps help 
>> handle them off-list (although if they have more general Solr 
>> application, we should, of course, post them to the list).
>>
>> Terry Steichen
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Reply via email to