RE: Hello, I have a question in extracting Texts from PDF file.

Kay_Lee Mon, 23 May 2016 18:40:48 -0700

Dear Mr. Tilman Hausherr, 
 
Please kindly accept my deep apology.
 
And I cordially thank your quick and excellent, delightful answer.
 
So far, I analyzed only the link to stackoverflow but will check all the link 
suggested by you.
 
My major is not related to software but just bio-chemistry and I'm finalizing 
the development of my application these days.
Therefore, I must take care of from A to Z, a millions of matters....I've been 
really hectic. Please kindly understand.
 
While I didn't fully check all the link from you, but it doesn't make sense I 
need all the many dll files to only extract text from PDF.
(But I'm really satisfied with the quality of PDFBox)
 
Hope you can also develop a 'nitro turbo' button as a library(.dll)


 
Again, my deepest appreciation to you.
 
All the best !
 
Truthfully yours,

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
[email protected]

 
> Subject: Re: Hello, I have a question in extracting Texts from PDF file.
> To: [email protected]
> From: [email protected]
> Date: Wed, 18 May 2016 09:11:08 +0200
> 
> Am 18.05.2016 um 04:21 schrieb Kay_Lee:
> > Hello,
> >   
> > I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in 
> > extracting Texts from PDF files.
> > Name: Su-Sang, Lee (English name: Kay Lee)
> > Cell Phone: +82-10-3180-7976
> > Residence: Seoul, South Korea, Asia
> > E-mail: [email protected] (or [email protected])
> >   
> > My software development environment is,
> >   
> > Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache 
> > PDFBOX library for .NET binaries, available as Nuget pacakage.)
> >   
> > I can extract Texts (our Korean language) from PDF file with many thanks to 
> > Apache Foundation.
> >   
> > However, what I concern most is that PDFBox takes little bit longer time in 
> > extracting than iTextSharp and other competitors.
> >   
> > What I need is only extracting Korean Text from PDF file and no more 
> > purposes.
> >
> > I tried to research on internet like google and stackoverflow but no 
> > specific solution and limited cases.
> >
> > 1) How can I extract text faster?
> 
> You can't. Unless you have a "turbo" or "nitro" button on the computer.
> 
> make sure you opening the files as files and not as streams. But I see 
> below, you already do that, i.e. your code is good.
> 
> > 2) And do I need all the library wtih more than 30 MB files, if I only need 
> > to extract Texts ?
> 
> Of PDFBox itself, you need  pdfbox and fontbox and logging. If files are 
> encrypted, then also bouncy castle. You won't need xmp and the image 
> libraries. See also here
> https://pdfbox.apache.org/1.8/dependencies.html
> 
> > If I only need some specific dll library files among all PDFBOX dll library 
> > files, could you please kindly let me know which ones ?
> >
> > 3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 
> > 1.8.12 and 2.0.1.
> 
> indeed. However there is no official .net release, i.e. none of the 
> "very active developers" is currently using that one (an older release 
> is here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster. 
> However they'll extract better.
> 
> There is a guide from 2012 to create the dlls:
> https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
> but I don't know if it works.
> 
> See also this: http://www.squarepdf.net/pdfbox-in-net
> https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net
> 
> >   
> > I don't belong to any company and organization but just a private person 
> > and developing a software to be distributed and used for free for 5 years 
> > as public profit purpose. As my major is not software-related but just 
> > bio-chemistry, please understand kindly and explain me in detail as 
> > possible as you'd be able.
> 
> If you're non profit and willing to distribute the source code, you can 
> use iText, see here: http://itextpdf.com/AGPL
> 
> >
> > My simple code to extract Text from PDF file is,
> >
> > internal static string ExtractTextFromPdf(string path)
> >          {
> >              PDDocument doc = null;
> >              try
> >              {
> >                  doc = PDDocument.load(path);
> >                  PDFTextStripper stripper = new PDFTextStripper();
> >                  stripper.setSuppressDuplicateOverlappingText(false);
> >                  return stripper.getText(doc);
> >              }
> >              finally
> >              {
> >                  if (doc != null)
> >                  {
> >                      doc.close();
> >                  }
> >              }
> >          }
> 
> Yes that code is fine.
> 
> Tilman
> 
> >   
> > Hope kind and excellent support.
> >
> > Thank you so much !
> >
> > Mr. Su-Sang, Lee (Kay Lee)
> > +82-10-3180-7976
> > [email protected]
> >   
> >                                     
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

RE: Hello, I have a question in extracting Texts from PDF file.

Reply via email to