RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Allison, Timothy B.
Concur on both points.  You can also use PDFBox's app ExtractText with 
-startPage and -endPage parameters: 
https://pdfbox.apache.org/1.8/commandline.html#extractText 

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, July 09, 2015 3:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Can I instruct the Tika Entity Processor to skip the first page 
using the DIH?

On 08/07/2015 20:39, Allison, Timothy B. wrote:
 Unfortunately, no.  We can't even do that now with straight Tika.  I
 imagine this is for pdf files?  If you'd like to add this as a
 feature, please submit a ticket over on Tika.

Another alternative is to pre-process the PDF files to remove the first 
page. I've used the command line version of PDFtk for this kind of thing 
in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

I'd also recommend using Tika outside Solr rather than via the DIH: 
certain nasty PDFs can kill Tika, which then can kill Solr.

Charlie

 -Original Message- From: Paden [mailto:rumsey...@gmail.com]
 Sent: Wednesday, July 08, 2015 12:14 PM To:
 solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity
 Processor to skip the first page using the DIH?

 Hello, I'm using the DIH to import some files from one of my local
 directories. However, every single one of these files has the same
 first page. So I want to skip that first page in order to optimize
 search.

 Can this be accomplished by an instruction within the
 dataimporthandler or, if not, how could you do this?



 -- View this message in context:
 http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html


Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Charlie Hull

On 08/07/2015 20:39, Allison, Timothy B. wrote:

Unfortunately, no.  We can't even do that now with straight Tika.  I
imagine this is for pdf files?  If you'd like to add this as a
feature, please submit a ticket over on Tika.


Another alternative is to pre-process the PDF files to remove the first 
page. I've used the command line version of PDFtk for this kind of thing 
in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/


I'd also recommend using Tika outside Solr rather than via the DIH: 
certain nasty PDFs can kill Tika, which then can kill Solr.


Charlie


-Original Message- From: Paden [mailto:rumsey...@gmail.com]
Sent: Wednesday, July 08, 2015 12:14 PM To:
solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity
Processor to skip the first page using the DIH?

Hello, I'm using the DIH to import some files from one of my local
directories. However, every single one of these files has the same
first page. So I want to skip that first page in order to optimize
search.

Can this be accomplished by an instruction within the
dataimporthandler or, if not, how could you do this?



-- View this message in context:
http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html



Sent from the Solr - User mailing list archive at Nabble.com.





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-08 Thread Allison, Timothy B.
Unfortunately, no.  We can't even do that now with straight Tika.  I imagine 
this is for pdf files?  If you'd like to add this as a feature, please submit a 
ticket over on Tika.

-Original Message-
From: Paden [mailto:rumsey...@gmail.com] 
Sent: Wednesday, July 08, 2015 12:14 PM
To: solr-user@lucene.apache.org
Subject: Can I instruct the Tika Entity Processor to skip the first page using 
the DIH?

Hello, I'm using the DIH to import some files from one of my local
directories. However, every single one of these files has the same first
page. So I want to skip that first page in order to optimize search. 

Can this be accomplished by an instruction within the dataimporthandler or,
if not, how could you do this? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html
Sent from the Solr - User mailing list archive at Nabble.com.