Concur on both points. You can also use PDFBox's app "ExtractText" with -startPage and -endPage parameters: https://pdfbox.apache.org/1.8/commandline.html#extractText
-----Original Message----- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Thursday, July 09, 2015 3:55 AM To: solr-user@lucene.apache.org Subject: Re: Can I instruct the Tika Entity Processor to skip the first page using the DIH? On 08/07/2015 20:39, Allison, Timothy B. wrote: > Unfortunately, no. We can't even do that now with straight Tika. I > imagine this is for pdf files? If you'd like to add this as a > feature, please submit a ticket over on Tika. Another alternative is to pre-process the PDF files to remove the first page. I've used the command line version of PDFtk for this kind of thing in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ I'd also recommend using Tika outside Solr rather than via the DIH: certain nasty PDFs can kill Tika, which then can kill Solr. Charlie > > -----Original Message----- From: Paden [mailto:rumsey...@gmail.com] > Sent: Wednesday, July 08, 2015 12:14 PM To: > solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity > Processor to skip the first page using the DIH? > > Hello, I'm using the DIH to import some files from one of my local > directories. However, every single one of these files has the same > first page. So I want to skip that first page in order to optimize > search. > > Can this be accomplished by an instruction within the > dataimporthandler or, if not, how could you do this? > > > > -- View this message in context: > http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html > > Sent from the Solr - User mailing list archive at Nabble.com. > -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk