Concur on both points.  You can also use PDFBox's app "ExtractText" with 
-startPage and -endPage parameters: 
https://pdfbox.apache.org/1.8/commandline.html#extractText 

-----Original Message-----
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, July 09, 2015 3:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Can I instruct the Tika Entity Processor to skip the first page 
using the DIH?

On 08/07/2015 20:39, Allison, Timothy B. wrote:
> Unfortunately, no.  We can't even do that now with straight Tika.  I
> imagine this is for pdf files?  If you'd like to add this as a
> feature, please submit a ticket over on Tika.

Another alternative is to pre-process the PDF files to remove the first 
page. I've used the command line version of PDFtk for this kind of thing 
in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

I'd also recommend using Tika outside Solr rather than via the DIH: 
certain nasty PDFs can kill Tika, which then can kill Solr.

Charlie
>
> -----Original Message----- From: Paden [mailto:rumsey...@gmail.com]
> Sent: Wednesday, July 08, 2015 12:14 PM To:
> solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity
> Processor to skip the first page using the DIH?
>
> Hello, I'm using the DIH to import some files from one of my local
> directories. However, every single one of these files has the same
> first page. So I want to skip that first page in order to optimize
> search.
>
> Can this be accomplished by an instruction within the
> dataimporthandler or, if not, how could you do this?
>
>
>
> -- View this message in context:
> http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html
>
>
Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to