Re: Need advice: what pdf lib to use?

2004-10-26 Thread iouli . golovatyi
OK, but even in this case parsing the doc would not be a violation, 
because actually what we need for lucene is just collection of terms. Has 
nothing to do with printing or copying of _text_ pieces.
As long You provide method returning just Document (I mean lucene 
document) permissions specified by the author of the PDF document are respected





Ben Litchfield [EMAIL PROTECTED]
25.10.2004 17:59
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what pdf lib to use?
Category: 




In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.

PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.

Ben

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

 Yes Ben, You are right.

 This would be correct functionality from technical perspective. But look
 it my way with application programmer eyes reporting to big boss that c.
 30% of doc we cope with could not be indexed because of this stupid
 limitation. Neither he or me have any influence on pdf owners and any
 ideas about what made  them create files with documet security set.

 In short, if You also could implement this uncorrect functionality the
 closed source guys did, it would be really great!

 As far as sponsoring is concerned I would be ready to hack (or at least 
to
 try) it even for 1/3 of that fortune:)))

 J.





 Ben Litchfield [EMAIL PROTECTED]
 25.10.2004 14:02
 Please respond to Lucene Users List


 To: Lucene Users List [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:




 PDFBox does not 'stumble' when it gives that message, that is correct
 functionality if that permission is not allowed.

 If your company is willing to pay a 'fortune' why not sponsor a change 
to
 an open source project for half a fortune.

 Ben
 http://www.pdfbox.org

 On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

  PDFbox stumbles also with class java.io.IOException with message:  -
 You
  do not have permission to extract text in case the doc is copy/print
  protected.
  I tested now the snowtide commercial product and it looks like it 
could
  process these files as well. Performance was also not so bad.
 Unfortunatly
  the test result could not be considered as 100%, because the free
 version
  processed just first  8  pages.  After all this product costs a 
fortune
  (as long the company is ready to pay I don't realy mind:))
 
  J.
 
 
 
 
 
  Robert Newson [EMAIL PROTECTED]
  Sent by: news [EMAIL PROTECTED]
  24.10.2004 17:44
  Please respond to Lucene Users List
 
 
  To: [EMAIL PROTECTED]
  cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
  Subject:Re: Need advice: what pdf lib to use?
  Category:
 
 
 
  [EMAIL PROTECTED] wrote:
   Hello all,
  
   I need a piece of advice/experience..
  
   What pdf parser (written in java) u'd recommend?
  
   I played now with PDFBox-0.6.7a and would not say I was satisfied 
too
  much
   with it
  
   On certain pdf's (not well formated but anyway readable with 
acrobate)
  it
   run into dead loop (this I could fix in code),
   and on one file it produced out of memory error and killed jvm:(
 (this
 
   problem I could not identify yet)
  
   After all the performance was not too great as well: it took c. 19 
h.
 to
 
   index 13000 files (c. 3.5Gb)
  
   Regards,
   J.
  
  
  
 
  On the specific problem of the dead loop, I reported an instance of
  this to Ben a week or so ago and he has fixed it in the latest
  nightlies.  I expect an official release will include this bugfix 
soon.
  The file in question was unreadable with any PDF software I have, but
  someone managed to create it somehow...
 
  http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
 
  I've found pdfbox to be pretty good. The only time I get problems is
  with corrupted or egregiously bad PDF files.
 
  B.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-25 Thread iouli . golovatyi
Ben,
many thanks for your complrehensive answer. Unfourtunatly I can not send 
the problem pdfs cause they are the property of company and are of top 
secrecy:)

Regards,
J.




Ben Litchfield [EMAIL PROTECTED]
22.10.2004 14:40
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what pdf lib to use?
Category: 




Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.

There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
http://sourceforge.net/tracker/index.php?func=detailaid=1046300group_id=78314atid=552832

As for other suggestions, I know some people have utilized xpdf(open
source but non Java) to extract the text.

For other Java solutions
PDFTextStream(commercial) - Fastest PDF-to-Text Solution for Java
http://snowtide.com/home/PDFTextStream/

Etymon PJ (GPL)
http://www.etymon.com/

Ben
http://www.pdfbox.org



On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote:

 Hello all,

 I need a piece of advice/experience..

 What pdf parser (written in java) u'd recommend?

 I played now with PDFBox-0.6.7a and would not say I was satisfied too 
much
 with it

 On certain pdf's (not well formated but anyway readable with acrobate) 
it
 run into dead loop (this I could fix in code),
 and on one file it produced out of memory error and killed jvm:( (this
 problem I could not identify yet)

 After all the performance was not too great as well: it took c. 19 h. to
 index 13000 files (c. 3.5Gb)

 Regards,
 J.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-25 Thread iouli . golovatyi
PDFbox stumbles also with class java.io.IOException with message:  - You 
do not have permission to extract text in case the doc is copy/print 
protected.
I tested now the snowtide commercial product and it looks like it could 
process these files as well. Performance was also not so bad. Unfortunatly 
the test result could not be considered as 100%, because the free version 
processed just first  8  pages.  After all this product costs a fortune 
(as long the company is ready to pay I don't realy mind:))

J.





Robert Newson [EMAIL PROTECTED]
Sent by: news [EMAIL PROTECTED]
24.10.2004 17:44
Please respond to Lucene Users List

 
To: [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what pdf lib to use?
Category: 



[EMAIL PROTECTED] wrote:
 Hello all,
 
 I need a piece of advice/experience..
 
 What pdf parser (written in java) u'd recommend?
 
 I played now with PDFBox-0.6.7a and would not say I was satisfied too 
much 
 with it
 
 On certain pdf's (not well formated but anyway readable with acrobate) 
it 
 run into dead loop (this I could fix in code),
 and on one file it produced out of memory error and killed jvm:( (this 

 problem I could not identify yet)
 
 After all the performance was not too great as well: it took c. 19 h. to 

 index 13000 files (c. 3.5Gb)
 
 Regards,
 J.
 
 
 

On the specific problem of the dead loop, I reported an instance of 
this to Ben a week or so ago and he has fixed it in the latest 
nightlies.  I expect an official release will include this bugfix soon. 
The file in question was unreadable with any PDF software I have, but 
someone managed to create it somehow...

http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832

I've found pdfbox to be pretty good. The only time I get problems is 
with corrupted or egregiously bad PDF files.

B.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea
[EMAIL PROTECTED] wrote:
Hi Iouli,
 If you don't think is illegal, you can hack the pdfbox code to remove 
the protection ...

   Sergiu
PDFbox stumbles also with class java.io.IOException with message:  - You 
do not have permission to extract text in case the doc is copy/print 
protected.
I tested now the snowtide commercial product and it looks like it could 
process these files as well. Performance was also not so bad. Unfortunatly 
the test result could not be considered as 100%, because the free version 
processed just first  8  pages.  After all this product costs a fortune 
(as long the company is ready to pay I don't realy mind:))

J.


Robert Newson [EMAIL PROTECTED]
Sent by: news [EMAIL PROTECTED]
24.10.2004 17:44
Please respond to Lucene Users List
   To: [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category: 


[EMAIL PROTECTED] wrote:
 

Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too 
   

much 
 

with it
On certain pdf's (not well formated but anyway readable with acrobate) 
   

it 
 

run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:( (this 
   

 

problem I could not identify yet)
After all the performance was not too great as well: it took c. 19 h. to 
   

 

index 13000 files (c. 3.5Gb)
Regards,
J.

   

On the specific problem of the dead loop, I reported an instance of 
this to Ben a week or so ago and he has fixed it in the latest 
nightlies.  I expect an official release will include this bugfix soon. 
The file in question was unreadable with any PDF software I have, but 
someone managed to create it somehow...

http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
I've found pdfbox to be pretty good. The only time I get problems is 
with corrupted or egregiously bad PDF files.

B.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield

PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.

If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.

Ben
http://www.pdfbox.org

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

 PDFbox stumbles also with class java.io.IOException with message:  - You
 do not have permission to extract text in case the doc is copy/print
 protected.
 I tested now the snowtide commercial product and it looks like it could
 process these files as well. Performance was also not so bad. Unfortunatly
 the test result could not be considered as 100%, because the free version
 processed just first  8  pages.  After all this product costs a fortune
 (as long the company is ready to pay I don't realy mind:))

 J.





 Robert Newson [EMAIL PROTECTED]
 Sent by: news [EMAIL PROTECTED]
 24.10.2004 17:44
 Please respond to Lucene Users List


 To: [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:



 [EMAIL PROTECTED] wrote:
  Hello all,
 
  I need a piece of advice/experience..
 
  What pdf parser (written in java) u'd recommend?
 
  I played now with PDFBox-0.6.7a and would not say I was satisfied too
 much
  with it
 
  On certain pdf's (not well formated but anyway readable with acrobate)
 it
  run into dead loop (this I could fix in code),
  and on one file it produced out of memory error and killed jvm:( (this

  problem I could not identify yet)
 
  After all the performance was not too great as well: it took c. 19 h. to

  index 13000 files (c. 3.5Gb)
 
  Regards,
  J.
 
 
 

 On the specific problem of the dead loop, I reported an instance of
 this to Ben a week or so ago and he has fixed it in the latest
 nightlies.  I expect an official release will include this bugfix soon.
 The file in question was unreadable with any PDF software I have, but
 someone managed to create it somehow...

 http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832

 I've found pdfbox to be pretty good. The only time I get problems is
 with corrupted or egregiously bad PDF files.

 B.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-25 Thread iouli . golovatyi
Yes Ben, You are right.

This would be correct functionality from technical perspective. But look 
it my way with application programmer eyes reporting to big boss that c. 
30% of doc we cope with could not be indexed because of this stupid 
limitation. Neither he or me have any influence on pdf owners and any 
ideas about what made  them create files with documet security set. 

In short, if You also could implement this uncorrect functionality  the 
closed source guys did, it would be really great!

As far as sponsoring is concerned I would be ready to hack (or at least to 
try) it even for 1/3 of that fortune:)))

J.





Ben Litchfield [EMAIL PROTECTED]
25.10.2004 14:02
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what pdf lib to use?
Category: 




PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.

If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.

Ben
http://www.pdfbox.org

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

 PDFbox stumbles also with class java.io.IOException with message:  - 
You
 do not have permission to extract text in case the doc is copy/print
 protected.
 I tested now the snowtide commercial product and it looks like it could
 process these files as well. Performance was also not so bad. 
Unfortunatly
 the test result could not be considered as 100%, because the free 
version
 processed just first  8  pages.  After all this product costs a fortune
 (as long the company is ready to pay I don't realy mind:))

 J.





 Robert Newson [EMAIL PROTECTED]
 Sent by: news [EMAIL PROTECTED]
 24.10.2004 17:44
 Please respond to Lucene Users List


 To: [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:



 [EMAIL PROTECTED] wrote:
  Hello all,
 
  I need a piece of advice/experience..
 
  What pdf parser (written in java) u'd recommend?
 
  I played now with PDFBox-0.6.7a and would not say I was satisfied too
 much
  with it
 
  On certain pdf's (not well formated but anyway readable with acrobate)
 it
  run into dead loop (this I could fix in code),
  and on one file it produced out of memory error and killed jvm:( 
(this

  problem I could not identify yet)
 
  After all the performance was not too great as well: it took c. 19 h. 
to

  index 13000 files (c. 3.5Gb)
 
  Regards,
  J.
 
 
 

 On the specific problem of the dead loop, I reported an instance of
 this to Ben a week or so ago and he has fixed it in the latest
 nightlies.  I expect an official release will include this bugfix soon.
 The file in question was unreadable with any PDF software I have, but
 someone managed to create it somehow...

 http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832

 I've found pdfbox to be pretty good. The only time I get problems is
 with corrupted or egregiously bad PDF files.

 B.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-25 Thread iouli . golovatyi
As far as 

  I need a piece of advice/experience..
 
  What pdf parser (written in java) u'd recommend?
 
  I played now with PDFBox-0.6.7a and would not say I was satisfied too
 much
  with it
 
  On certain pdf's (not well formated but anyway readable with acrobate)
 it
  run into dead loop (this I could fix in code),
  and on one file it produced out of memory error and killed jvm:( 
(this

  problem I could not identify yet)
 
  After all the performance was not too great as well: it took c. 19 h. 
to

  index 13000 files (c. 3.5Gb)
 
  Regards,
  J.
 
 
 

 On the specific problem of the dead loop, I reported an instance of
 this to Ben a week or so ago and he has fixed it in the latest
 nightlies.  I expect an official release will include this bugfix soon.
 The file in question was unreadable with any PDF software I have, but
 someone managed to create it somehow...

 http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832

 I've found pdfbox to be pretty good. The only time I get problems is
 with corrupted or egregiously bad PDF files.

 B.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-25 Thread iouli . golovatyi





Ben,
As far as as dead loop problem is concerned it looks like I experienced a
bit different problem.
I published it under the same tracking path

Regards
J.

  I need a piece of advice/experience..
 
  What pdf parser (written in java) u'd recommend?
 
  I played now with PDFBox-0.6.7a and would not say I was satisfied too
 much
  with it
 
  On certain pdf's (not well formated but anyway readable with acrobate)
 it
  run into dead loop (this I could fix in code),
  and on one file it produced out of memory error and killed jvm:(
(this

  problem I could not identify yet)
 
  After all the performance was not too great as well: it took c. 19 h.
to

  index 13000 files (c. 3.5Gb)
 
  Regards,
  J.
 
 
 

 On the specific problem of the dead loop, I reported an instance of
 this to Ben a week or so ago and he has fixed it in the latest
 nightlies.  I expect an official release will include this bugfix soon.
 The file in question was unreadable with any PDF software I have, but
 someone managed to create it somehow...


http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832


 I've found pdfbox to be pretty good. The only time I get problems is
 with corrupted or egregiously bad PDF files.

 B.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield

In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.

PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.

Ben

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

 Yes Ben, You are right.

 This would be correct functionality from technical perspective. But look
 it my way with application programmer eyes reporting to big boss that c.
 30% of doc we cope with could not be indexed because of this stupid
 limitation. Neither he or me have any influence on pdf owners and any
 ideas about what made  them create files with documet security set.

 In short, if You also could implement this uncorrect functionality  the
 closed source guys did, it would be really great!

 As far as sponsoring is concerned I would be ready to hack (or at least to
 try) it even for 1/3 of that fortune:)))

 J.





 Ben Litchfield [EMAIL PROTECTED]
 25.10.2004 14:02
 Please respond to Lucene Users List


 To: Lucene Users List [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:




 PDFBox does not 'stumble' when it gives that message, that is correct
 functionality if that permission is not allowed.

 If your company is willing to pay a 'fortune' why not sponsor a change to
 an open source project for half a fortune.

 Ben
 http://www.pdfbox.org

 On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

  PDFbox stumbles also with class java.io.IOException with message:  -
 You
  do not have permission to extract text in case the doc is copy/print
  protected.
  I tested now the snowtide commercial product and it looks like it could
  process these files as well. Performance was also not so bad.
 Unfortunatly
  the test result could not be considered as 100%, because the free
 version
  processed just first  8  pages.  After all this product costs a fortune
  (as long the company is ready to pay I don't realy mind:))
 
  J.
 
 
 
 
 
  Robert Newson [EMAIL PROTECTED]
  Sent by: news [EMAIL PROTECTED]
  24.10.2004 17:44
  Please respond to Lucene Users List
 
 
  To: [EMAIL PROTECTED]
  cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
  Subject:Re: Need advice: what pdf lib to use?
  Category:
 
 
 
  [EMAIL PROTECTED] wrote:
   Hello all,
  
   I need a piece of advice/experience..
  
   What pdf parser (written in java) u'd recommend?
  
   I played now with PDFBox-0.6.7a and would not say I was satisfied too
  much
   with it
  
   On certain pdf's (not well formated but anyway readable with acrobate)
  it
   run into dead loop (this I could fix in code),
   and on one file it produced out of memory error and killed jvm:(
 (this
 
   problem I could not identify yet)
  
   After all the performance was not too great as well: it took c. 19 h.
 to
 
   index 13000 files (c. 3.5Gb)
  
   Regards,
   J.
  
  
  
 
  On the specific problem of the dead loop, I reported an instance of
  this to Ben a week or so ago and he has fixed it in the latest
  nightlies.  I expect an official release will include this bugfix soon.
  The file in question was unreadable with any PDF software I have, but
  someone managed to create it somehow...
 
  http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
 
  I've found pdfbox to be pretty good. The only time I get problems is
  with corrupted or egregiously bad PDF files.
 
  B.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea
Ben Litchfield wrote:
In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.
PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.
 

I wanted to say something like this in one of my previous emails, when I 
said that  anyone can modify the code of
PDFBox to replace the restrictions

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.
 

This seems to me as beeing a business decision,
Iouli  if your boss tels you that PDFBox is useless because it 
prevents you to get the text from protected pdfs,
than you should say him ... I can fix it but it is not legal. You can 
hack PDFbox, but before doing this you should
ensure that the authors let you do it.

All the best,
 Sergiu

Ben
On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
 

Yes Ben, You are right.
This would be correct functionality from technical perspective. But look
it my way with application programmer eyes reporting to big boss that c.
30% of doc we cope with could not be indexed because of this stupid
limitation. Neither he or me have any influence on pdf owners and any
ideas about what made  them create files with documet security set.
In short, if You also could implement this uncorrect functionality  the
closed source guys did, it would be really great!
As far as sponsoring is concerned I would be ready to hack (or at least to
try) it even for 1/3 of that fortune:)))
J.


Ben Litchfield [EMAIL PROTECTED]
25.10.2004 14:02
Please respond to Lucene Users List
   To: Lucene Users List [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category:

PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.
If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.
Ben
http://www.pdfbox.org
On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
   

PDFbox stumbles also with class java.io.IOException with message:  -
 

You
   

do not have permission to extract text in case the doc is copy/print
protected.
I tested now the snowtide commercial product and it looks like it could
process these files as well. Performance was also not so bad.
 

Unfortunatly
   

the test result could not be considered as 100%, because the free
 

version
   

processed just first  8  pages.  After all this product costs a fortune
(as long the company is ready to pay I don't realy mind:))
J.


Robert Newson [EMAIL PROTECTED]
Sent by: news [EMAIL PROTECTED]
24.10.2004 17:44
Please respond to Lucene Users List
   To: [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category:

[EMAIL PROTECTED] wrote:
 

Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too
   

much
 

with it
On certain pdf's (not well formated but anyway readable with acrobate)
   

it
 

run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:(
   

(this
   

problem I could not identify yet)
After all the performance was not too great as well: it took c. 19 h.
   

to
   

index 13000 files (c. 3.5Gb)
Regards,
J.

   

On the specific problem of the dead loop, I reported an instance of
this to Ben a week or so ago and he has fixed it in the latest
nightlies.  I expect an official release will include this bugfix soon.
The file in question was unreadable with any PDF software I have, but
someone managed to create it somehow...
http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
I've found pdfbox to be pretty good. The only time I get problems is
with corrupted or egregiously bad PDF files.
B.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Need advice: what pdf lib to use?

2004-10-25 Thread Chris Fraschetti
I recently started to work on a project which needed to parse many
documents, including pdfs, very quickly and on a large scale. PDF Box
seems to look like the best choice except for it's obvious speed
issue. Eventually I took the time to go into the pdf box source and
rip out the individual string tokens. In doing so I lose the quality
of some docs, kerning isn't there, special chars arn't avail, but you
can even improve on that yourself a bit with a little more research.
But parsing some pdf documents scaled down from 60 seconds to strip
the raw text, to 5 seconds an exchange I gladly make for the speed
improvement.

It's open source, with any IDE, you should be able to trace the
function calls to find where to rip out the data you need.

-Chris Fraschetti


On Mon, 25 Oct 2004 18:07:52 +0200, sergiu gordea
[EMAIL PROTECTED] wrote:
 Ben Litchfield wrote:
 
 In order to write software that consumes PDF documents you must agree to a
 list of conditions.  One of those conditions is that permissions specified
 by the author of the PDF document are respected.
 
 PDFBox complies with this statement, if there is software that does not
 then they are in violation of copyright law.
 
 
 
 I wanted to say something like this in one of my previous emails, when I
 said that  anyone can modify the code of
 PDFBox to replace the restrictions
 
 That being said, PDFBox is open source so a user could make modifications
 to the source code, or as a PDF library could change permissions on a
 document.
 
 
 This seems to me as beeing a business decision,
 
  Iouli  if your boss tels you that PDFBox is useless because it
 prevents you to get the text from protected pdfs,
 than you should say him ... I can fix it but it is not legal. You can
 hack PDFbox, but before doing this you should
 ensure that the authors let you do it.
 
  All the best,
 
   Sergiu
 
 
 
 
 Ben
 
 On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
 
 
 
 Yes Ben, You are right.
 
 This would be correct functionality from technical perspective. But look
 it my way with application programmer eyes reporting to big boss that c.
 30% of doc we cope with could not be indexed because of this stupid
 limitation. Neither he or me have any influence on pdf owners and any
 ideas about what made  them create files with documet security set.
 
 In short, if You also could implement this uncorrect functionality  the
 closed source guys did, it would be really great!
 
 As far as sponsoring is concerned I would be ready to hack (or at least to
 try) it even for 1/3 of that fortune:)))
 
 J.
 
 
 
 
 
 Ben Litchfield [EMAIL PROTECTED]
 25.10.2004 14:02
 Please respond to Lucene Users List
 
 
 To: Lucene Users List [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:
 
 
 
 
 PDFBox does not 'stumble' when it gives that message, that is correct
 functionality if that permission is not allowed.
 
 If your company is willing to pay a 'fortune' why not sponsor a change to
 an open source project for half a fortune.
 
 Ben
 http://www.pdfbox.org
 
 On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
 
 
 
 PDFbox stumbles also with class java.io.IOException with message:  -
 
 
 You
 
 
 do not have permission to extract text in case the doc is copy/print
 protected.
 I tested now the snowtide commercial product and it looks like it could
 process these files as well. Performance was also not so bad.
 
 
 Unfortunatly
 
 
 the test result could not be considered as 100%, because the free
 
 
 version
 
 
 processed just first  8  pages.  After all this product costs a fortune
 (as long the company is ready to pay I don't realy mind:))
 
 J.
 
 
 
 
 
 Robert Newson [EMAIL PROTECTED]
 Sent by: news [EMAIL PROTECTED]
 24.10.2004 17:44
 Please respond to Lucene Users List
 
 
 To: [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:
 
 
 
 [EMAIL PROTECTED] wrote:
 
 
 Hello all,
 
 I need a piece of advice/experience..
 
 What pdf parser (written in java) u'd recommend?
 
 I played now with PDFBox-0.6.7a and would not say I was satisfied too
 
 
 much
 
 
 with it
 
 On certain pdf's (not well formated but anyway readable with acrobate)
 
 
 it
 
 
 run into dead loop (this I could fix in code),
 and on one file it produced out of memory error and killed jvm:(
 
 
 (this
 
 
 problem I could not identify yet)
 
 After all the performance was not too great as well: it took c. 19 h.
 
 
 to
 
 
 index 13000 files (c. 3.5Gb)
 
 Regards,
 J.
 
 
 
 
 
 On the specific problem of the dead loop, I reported an instance of
 this to Ben a week or so ago and he has fixed it in the latest
 nightlies.  I expect an official release will include this bugfix soon.
 The file in question was unreadable with any PDF software I have, but
 someone managed to create

Re: Need advice: what pdf lib to use?

2004-10-24 Thread Robert Newson
[EMAIL PROTECTED] wrote:
Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too much 
with it

On certain pdf's (not well formated but anyway readable with acrobate)  it 
run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:( (this 
problem I could not identify yet)

After all the performance was not too great as well: it took c. 19 h. to 
index 13000 files (c. 3.5Gb)

Regards,
J.

On the specific problem of the dead loop, I reported an instance of 
this to Ben a week or so ago and he has fixed it in the latest 
nightlies.  I expect an official release will include this bugfix soon. 
The file in question was unreadable with any PDF software I have, but 
someone managed to create it somehow...

http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
I've found pdfbox to be pretty good. The only time I get problems is 
with corrupted or egregiously bad PDF files.

B.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Need advice: what pdf lib to use?

2004-10-22 Thread iouli . golovatyi
Hello all,

I need a piece of advice/experience..

What pdf parser (written in java) u'd recommend?

I played now with PDFBox-0.6.7a and would not say I was satisfied too much 
with it

On certain pdf's (not well formated but anyway readable with acrobate)  it 
run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:( (this 
problem I could not identify yet)

After all the performance was not too great as well: it took c. 19 h. to 
index 13000 files (c. 3.5Gb)

Regards,
J.




Re: Need advice: what pdf lib to use?

2004-10-22 Thread Ben Litchfield

Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.

There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
http://sourceforge.net/tracker/index.php?func=detailaid=1046300group_id=78314atid=552832

As for other suggestions, I know some people have utilized xpdf(open
source but non Java) to extract the text.

For other Java solutions
PDFTextStream(commercial) - Fastest PDF-to-Text Solution for Java
http://snowtide.com/home/PDFTextStream/

Etymon PJ (GPL)
http://www.etymon.com/

Ben
http://www.pdfbox.org



On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote:

 Hello all,

 I need a piece of advice/experience..

 What pdf parser (written in java) u'd recommend?

 I played now with PDFBox-0.6.7a and would not say I was satisfied too much
 with it

 On certain pdf's (not well formated but anyway readable with acrobate)  it
 run into dead loop (this I could fix in code),
 and on one file it produced out of memory error and killed jvm:( (this
 problem I could not identify yet)

 After all the performance was not too great as well: it took c. 19 h. to
 index 13000 files (c. 3.5Gb)

 Regards,
 J.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]