Hi Timothy, As I understand it, Tika is integrated with Solr. All my indexed documents declare that they've been parsed by tika. For the eml files it's: |org.apache.tika.parser.mail.RFC822Parser Word docs show they were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser PDF files show: ||org.apache.tika.parser.pdf.PDFParser|
|| || What do you mean by improving the output with "tika-eval?" I confess I don't completely understand how documents should be prepared for indexing. But with the eml docs, solr/tika seems to properly pull out things like date, subject, to and from. Other (so-called 'rich text') documents (like pdfs and Word-type), the metadata is not so useful, but on the other hand, there's not much consistent structure to the documents I have to deal with. I may be missing something - am I? Regards, Terry On 04/17/2018 09:38 AM, Allison, Timothy B. wrote: > +1 to Charlie's guidance. > > And... > >> 60,000 documents, mostly pdfs and emails. >> However, there's a premium on precision (and recall) in searches. > Please, oh, please, no matter what you're using for content/text extraction > and/or OCR, run tika-eval[1] on the output to ensure that that you are > getting mostly language-y content out of your documents. Ping us on the Tika > user's list if you have any questions. > > Bad text, bad search. 😊 > > [1] https://wiki.apache.org/tika/TikaEval > > -----Original Message----- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Tuesday, April 17, 2018 4:17 AM > To: solr-user@lucene.apache.org > Subject: Re: Specialized Solr Application > > On 16/04/2018 19:48, Terry Steichen wrote: >> I have from time-to-time posted questions to this list (and received >> very prompt and helpful responses). But it seems that many of you are >> operating in a very different space from me. The problems (and >> lessons-learned) which I encounter are often very different from those >> that are reflected in exchanges with most other participants. > Hi Terry, > > Sounds like a fascinating use case. We have some similar clients - small > scale law firms and publishers - who have taken advantage of Solr. > > One thing I would encourage you to do is to blog and/or talk about what > you've built. Lucene Revolution is worth applying to talk at and if you do > manage to get accepted - or if you go anyway - you'll meet lots of others > with similar challenges and come away with a huge amount of useful > information and contacts. Otherwise there are lots of smaller Meetup events > (we run the London, UK one). > > Don't assume just because some people here are describing their 350 billion > document learning-to-rank clustered monster that the small applications don't > matter - they really do, and the fact that they're possible to build at all > is a testament to the open source model and how we share information and tips. > > Cheers > > Charlie >> So I thought it would be useful to describe what I'm about, and see if >> there are others out there with similar implementations (or interest >> in moving in that direction). A sort of pay-forward. >> >> We (the Lakota Peoples Law Office) are a small public interest, pro >> bono law firm actively engaged in defending Native American North >> Dakota Water Protector clients against (ridiculously excessive) criminal >> charges. >> >> I have a small Solr (6.6.0) implementation - just one shard. I'm >> using the cloud mode mainly to be able to implement access controls. >> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with >> 8GB of RAM and 4 cpu processors. We presently have 8 collections with >> a total of about 60,000 documents, mostly pdfs and emails. The >> indexed documents are partly our own files and partly those we obtain >> through legal discovery (which, surprisingly, is allowed in ND for >> criminal cases). We only have a few users (our lawyers and a couple >> of researchers mostly), so traffic is minimal. However, there's a >> premium on precision (and recall) in searches. >> >> The document repository is local to the server. I piggyback on the >> embedded Jetty httpd in order to serve files (selected from the >> hitlists). I just use a symbolic link to tie the repository to >> Solr/Jetty's "webapp" subdirectory. >> >> We provide remote access via ssh with port forwarding. It provides >> very snappy performance, with fully encrypted links. Appears quite stable. >> >> I've had some bizarre behavior apparently caused by an interaction >> between repository permissions, solr permissions and the ssh link. I >> seem "solved" for the moment, but time will tell for how long. >> >> If there are any folks out there who have similar requirements, I'd be >> more than happy to share the insights I've gained and problems I've >> encountered and (I think) overcome. There are so many unique parts of >> this small scale, specialized application (many dimensions of which >> are not strictly internal to Solr) that it probably won't be >> appreciated to dump them on this (excellent) Solr list. So, if you >> encounter problems peculiar to this kind of setup, we can perhaps help >> handle them off-list (although if they have more general Solr >> application, we should, of course, post them to the list). >> >> Terry Steichen >> > > -- > Charlie Hull > Flax - Open Source Enterprise Search > > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > web: www.flax.co.uk >