Hi Luca,

What exactly happens in fedoragsearch.log, when PDFBox fails? Yes, Tika uses 
PDFBox 1.6.0, but I tend to think that the fault is not in PDFBox, but in the 
pdf file and/or in the program that generated the pdf file. If you have an 
example pdf file, which PDFBox could not extract text from, and you have 
another program (which one?), which could, then I think you should talk to the 
PDFBox community about it.

Gert

On 15/03/2012, at 08.44, Luca Lelli wrote:

> Il 13/03/2012 10.35, Gert Schmeltz Pedersen ha scritto:
>> 
>> Have you looked into fedoragsearch.log? What does it say, when the pdf is 
>> fetched and indexed? Besides, you should go to GSearch 2.4.1, because it has 
>> better logging for this, and you might use the Tika extraction functions.
>> 
>> Gert
> Hi Gert,
> thanks for your answer. We hacve seen that the problem is in PDFBOX 1.6.0 
> which fails in extracting text. For what concerns Tika, does it use PDFBox to 
> extract text from PDF files?
> best regards,
> Luca
>> 
>> 
>> On 12/03/2012, at 14.40, Luca Lelli wrote:
>> 
>>> Hi all,
>>> we have installed GSearch 2.3 which uses last PDFBox version (1.6.0) and we 
>>> tried to index a set of pdf files which contain text. But the Gsearch 
>>> function GetDatastreamtext returns an empty string. This PDF files really 
>>> contain text because we may extract it with other tools. A sample of these 
>>> PDF files is 
>>> 'http://magteca-fi.inera.it:80/fedora/e_ntc/2012/0308/17/31/mag_2825+MM294339116b1bf925ddb97c303d5c0f3f+MM294339116b1bf925ddb97c303d5c0f3f.0'
>>> Do you know something more about a problem like this one?
>>> thanks
>>> -- 
>>> Luca Lelli
>>> 
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2_______________________________________________
>>> Fedora-commons-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Keep Your Developer Skills Current with LearnDevNow!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-d2d
>> 
>> 
>> _______________________________________________
>> Fedora-commons-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> 
> 
> -- 
> Luca Lelli
> --------------------------
> INERA srl
> http://www.inera.it
> Via Mazzini 138
> 56125 Pisa
> Italy
> tel: +39 050 9911815
> fax: +39 050 9911830
> email: [email protected]
> --------------------------
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here 
> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
> Fedora-commons-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to