Re: Treatment of protected files

Karl Wright Tue, 26 Apr 2011 22:45:36 -0700

So the 500 error is occurring because Solr is throwing an exception at
indexing time, is that correct?

If this is correct, then here's my take.  (1) A 500 error is a nasty
error that Solr should not be returning under normal conditions.  (2)
A password-protected PDF is not what I would consider exceptional, so
Tika should not be throwing an exception when it sees it, merely (at
worst) logging an error and continuing.  However, having said that,
output connectors in ManifoldCF can make the decision to never retry
the document, by returning a certain status, provided the connector
can figure out that the error warrants this treatment.

My suggestion is therefore the following.  First, we should open a
ticket for Solr about this.  Second, if you can see the error output
from the Simple History for a TikaException being thrown in Solr, we
can look for that text in the response from Solr and perhaps modify
the Solr Connector to detect the case.  If you could open a ManifoldCF
ticket and include that text I'd be very grateful.

Thanks!
Karl

On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
<shinichiro.ab...@gmail.com> wrote:
> Hello.
>
> There are pdf and office files that are protected by reading password.
> We do not have to read those files if we do not know the password of files.
>
> Now, MCF job starts to crawl the filesystem repository and post to Solr.
> Document ingestion of non-protected files is done successfully,
> but one of protected file is not done successfully as far as the job is 
> processed beyond Retry Limit.
> During that time, it is logging 500 result code in simple history.
> (Solr throws TikaException caused by PDFBox or apache poi as the reason that 
> it does not read protected documents.)
>
> When I ran that test by continuous clawing, not by simple once crawling,
> the job was done halfway and logged the following:
> Error: Repeated service interruptions - failure processing document: 
> Ingestion HTTP error code 500
> the job tried to crawl that files many times.
>
> It seems that a job takes a lot of time and costs for treating protected 
> files.
> So I want to find a way to skip quickly reading those files.
>
> In my survey:
> Hopfillers is not relevant.(right?)
> Then Tika, PDFBox, and POI have the mechanism to decrypt protected files,
> but throw each another exception in the case that given invalid password.
> It occurs to me that Solr throws another result code when protected files are 
> posted,
> as one idea apart from possibility or not.
>
> Do you have any ideas?
>
> Regards,
> Shinichiro Abe

Re: Treatment of protected files

Reply via email to