Terry:

If your process works, then it works and there's no real reason to change.

I was commingling the structure of the content with the metadata. You're
right that the content doesn't really have any useful structure. Sometimes
you can get some useful information out of the metadata, particularly
metadata that doesn't require a user action (last_modified and the like,
sometimes).

Whether that effort is worth it in your use-case is, of course, a valid
question.....

bq: On OCRs, I presume you're referring to PDFs that are images?

No, I was referring to scanned images. I once had to try to index
a document (I wouldn't lie to you) that was a scanned image of
a "family tree" where the most remote ancestor was written
vertically on the trunk, and each branch had a descendant
written at various angles. The resulting scanned image
was run through an OCR program that produces...well, let's
just say little of value ;)..

Best,
Erick

On Wed, Apr 18, 2018 at 8:10 AM, Terry Steichen <te...@net-frame.com> wrote:
> Thanks, Erick.  What I don't understand that "rich text documents" (aka,
> PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
> there's not much potential in trying to get really precise in parsing
> them.  Or am I overlooking something here?
>
> And, as you say, the metadata of such documents is not somewhat variable
> (some PDFs have a field and others don't), which suggests that you may
> not want the parser to be rigid.
>
> Moreover, as I noted earlier, most of the metadata fields of such
> documents seem to be of little value (since many document authors are
> not consistent in creating that information).
>
> I take your point about non-optimum Tika workload distribution - but I
> am only occasionally doing indexing so I don't think that would be a
> significant factor (for me, at least).
>
> A point of possible interest: I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I
> had to restart it.  I removed the offending document, and restarted the
> indexing.  It then eventually happened again, so I did the same thing.
> It then completed indexing successfully.  IOW, out of 13,000 documents
> there were two that caused a crash, but once they were removed, the
> other 12,998 were parsed/indexed fine.
>
> On OCRs, I presume you're referring to PDFs that are images?  Part of
> our team uses Acrobat Pro to screen and convert such documents (which
> are very common in legal circles) so they can be searched.  Or did you
> mean something else?
>
> Thanks for the insights.  And the long answers (from you, Tim and
> Charlie).  These are helping me (and I hope others on the list) to
> better understand some of the nuances of effectively implementing
> (small-scale) solr.
>
>
> On 04/17/2018 10:35 PM, Erick Erickson wrote:
>> Terry:
>>
>> Tika has a horrible problem to deal with and it's approaching a
>> miracle that it does so well ;)
>>
>> Let's take a PDF file. Which vendor's version? From what _decade_? Did
>> that vendor adhere
>> to the spec? Every spec has gray areas so even good-faith efforts can
>> result in some version/vendor
>> behaving slightly differently from the other.
>>
>> And what about Word .vs. PDF? One might have "last_modified" and the
>> other might have
>> "last_edited" to mean the same thing. You mentioned that you're aware
>> of this, you can make
>> it more useful if you have finer-grained control over the ETL process.
>>
>> You say "As I understand it, Tika is integrated with Solr"  which is
>> correct, you're talking about
>> the "Extracting Request Handler". However that has a couple of
>> important caveats:
>>
>> 1> It does the best it can. But Tika has a _lot_ of tuning options
>> that allow you to get down-and-dirty
>> with the data you're indexing. You mentioned that precision is
>> important. You can do some interesting
>> things with extracting specific fields from specific kinds of
>> documents and making use of them. The
>> "last_modified" and "last_edited" fields above are an example.
>>
>> 2> It loads the work on a single Solr node. So the very expensive
>> process of extracting data from the
>> semi-structure document is all on the Solr node. If you use Tika in a
>> client-side program you can
>> parallelize the extraction and get through your indexing much more quickly.
>>
>> 3> Tika can occasionally get its knickers in a knot over some
>> particular document. That'll also bring
>> down the Solr instance.
>>
>> Here's a blog that can get you started doing client-side parsing,
>> ignore the RDBMS bits.
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> I'll leave Tim to talk about tika-eval ;) But the general problem is
>> that the extraction process can
>> result in garbage, lots of garbage. OCR is particularly prone to
>> nonsense. PDFs can be tricky,
>> there's this spacing parameter that, depending on it's setting can
>> render e r i c k as 5 separate
>> letters or my name.
>>
>> Hey, you asked! Don't complain about long answers ;)
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen <te...@net-frame.com> wrote:
>>> Hi Timothy,
>>>
>>> As I understand it, Tika is integrated with Solr.  All my indexed
>>> documents declare that they've been parsed by tika.  For the eml files
>>> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
>>> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
>>> files show: ||org.apache.tika.parser.pdf.PDFParser|
>>>
>>> ||
>>>
>>> ||
>>>
>>> What do you mean by improving the output with "tika-eval?"  I confess I
>>> don't completely understand how documents should be prepared for
>>> indexing.  But with the eml docs, solr/tika seems to properly pull out
>>> things like date, subject, to and from.  Other (so-called 'rich text')
>>> documents (like pdfs and Word-type), the metadata is not so useful, but
>>> on the other hand, there's not much consistent structure to the
>>> documents I have to deal with.
>>>
>>> I may be missing something - am I?
>>>
>>> Regards,
>>>
>>> Terry
>>>
>>>
>>> On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
>>>> +1 to Charlie's guidance.
>>>>
>>>> And...
>>>>
>>>>> 60,000 documents, mostly pdfs and emails.
>>>>> However, there's a premium on precision (and recall) in searches.
>>>> Please, oh, please, no matter what you're using for content/text 
>>>> extraction and/or OCR, run tika-eval[1] on the output to ensure that that 
>>>> you are getting mostly language-y content out of your documents.  Ping us 
>>>> on the Tika user's list if you have any questions.
>>>>
>>>> Bad text, bad search. 😊
>>>>
>>>> [1] https://wiki.apache.org/tika/TikaEval
>>>>
>>>> -----Original Message-----
>>>> From: Charlie Hull [mailto:char...@flax.co.uk]
>>>> Sent: Tuesday, April 17, 2018 4:17 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Specialized Solr Application
>>>>
>>>> On 16/04/2018 19:48, Terry Steichen wrote:
>>>>> I have from time-to-time posted questions to this list (and received
>>>>> very prompt and helpful responses).  But it seems that many of you are
>>>>> operating in a very different space from me.  The problems (and
>>>>> lessons-learned) which I encounter are often very different from those
>>>>> that are reflected in exchanges with most other participants.
>>>> Hi Terry,
>>>>
>>>> Sounds like a fascinating use case. We have some similar clients - small 
>>>> scale law firms and publishers - who have taken advantage of Solr.
>>>>
>>>> One thing I would encourage you to do is to blog and/or talk about what 
>>>> you've built. Lucene Revolution is worth applying to talk at and if you do 
>>>> manage to get accepted - or if you go anyway - you'll meet lots of others 
>>>> with similar challenges and come away with a huge amount of useful 
>>>> information and contacts. Otherwise there are lots of smaller Meetup 
>>>> events (we run the London, UK one).
>>>>
>>>> Don't assume just because some people here are describing their 350 
>>>> billion document learning-to-rank clustered monster that the small 
>>>> applications don't matter - they really do, and the fact that they're 
>>>> possible to build at all is a testament to the open source model and how 
>>>> we share information and tips.
>>>>
>>>> Cheers
>>>>
>>>> Charlie
>>>>> So I thought it would be useful to describe what I'm about, and see if
>>>>> there are others out there with similar implementations (or interest
>>>>> in moving in that direction).  A sort of pay-forward.
>>>>>
>>>>> We (the Lakota Peoples Law Office) are a small public interest, pro
>>>>> bono law firm actively engaged in defending Native American North
>>>>> Dakota Water Protector clients against (ridiculously excessive) criminal 
>>>>> charges.
>>>>>
>>>>> I have a small Solr (6.6.0) implementation - just one shard.  I'm
>>>>> using the cloud mode mainly to be able to implement access controls.
>>>>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
>>>>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
>>>>> a total of about 60,000 documents, mostly pdfs and emails.  The
>>>>> indexed documents are partly our own files and partly those we obtain
>>>>> through legal discovery (which, surprisingly, is allowed in ND for
>>>>> criminal cases).  We only have a few users (our lawyers and a couple
>>>>> of researchers mostly), so traffic is minimal.  However, there's a
>>>>> premium on precision (and recall) in searches.
>>>>>
>>>>> The document repository is local to the server.  I piggyback on the
>>>>> embedded Jetty httpd in order to serve files (selected from the
>>>>> hitlists).  I just use a symbolic link to tie the repository to
>>>>> Solr/Jetty's "webapp" subdirectory.
>>>>>
>>>>> We provide remote access via ssh with port forwarding.  It provides
>>>>> very snappy performance, with fully encrypted links.  Appears quite 
>>>>> stable.
>>>>>
>>>>> I've had some bizarre behavior apparently caused by an interaction
>>>>> between repository permissions, solr permissions and the ssh link.  I
>>>>> seem "solved" for the moment, but time will tell for how long.
>>>>>
>>>>> If there are any folks out there who have similar requirements, I'd be
>>>>> more than happy to share the insights I've gained and problems I've
>>>>> encountered and (I think) overcome.  There are so many unique parts of
>>>>> this small scale, specialized application (many dimensions of which
>>>>> are not strictly internal to Solr) that it probably won't be
>>>>> appreciated to dump them on this (excellent) Solr list.  So, if you
>>>>> encounter problems peculiar to this kind of setup, we can perhaps help
>>>>> handle them off-list (although if they have more general Solr
>>>>> application, we should, of course, post them to the list).
>>>>>
>>>>> Terry Steichen
>>>>>
>>>> --
>>>> Charlie Hull
>>>> Flax - Open Source Enterprise Search
>>>>
>>>> tel/fax: +44 (0)8700 118334
>>>> mobile:  +44 (0)7767 825828
>>>> web: www.flax.co.uk
>>>>
>

Reply via email to