Re: Problem when extracting text from a pdf file

Maruan Sahyoun Wed, 21 May 2014 02:15:13 -0700

Hi Mehmet,

sry - now I see your issue. It’s an encoding issue of the PDF. Copying & 
Pasting using Adobe Reader gives the same result. I don’t think that we can do 
very much about it but I’ll look into it in more detail.


BR

Maruan Sahyoun

Am 21.05.2014 um 11:06 schrieb Mehmet Ali Abdulhayoglu 
<[email protected]>:

> Dear Maruan,
> 
> I have checked them again. I am sure that they are correct ones.
> 
> The pdf coming from the first link has a title of "Olfactory Learning-Induced 
> Increase in Spine Density Along the...Neurons". I can process for this pdf.
> 
> The second one has a title : "Relationship between intercepted radiation, net 
> photosynthesis, respiration, and rate of ....densities". I could not handle 
> this one.
> 
> Indeed, when I copy and paste some text from this pdf, what I get is like: 
> 
> *
> 
>  
> 
> 
> 
> When you extract the text from the second one, did you make use of the java 
> script that I sent in my first mail or use another one?
> 
> 
> Thanks for your attention.
> 
> Best regards,
> Mehmet 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:[email protected]] 
> Sent: Wednesday 21 May 2014 8:51 AM
> To: [email protected]
> Subject: Re: Problem when extracting text from a pdf file
> 
> Dear Mehmet,
> 
> did you supply the correct PDF's? I can manual copy & paste text from both as 
> well as extract the text using PDFBox for both.
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu 
> <[email protected]>:
> 
>> Dear Maruan,
>> 
>> Thanks for your reply. Below you can find the related links for the pdf 
>> files. As you state, from the first pdf (dnm1) I can manually copy paste the 
>> text while this is not possible for the second one (pdf) which shows that 
>> the later one contains no real text.
>> 
>> Is there any other ways to extract text from such pdfs like dnm2?
>> 
>> dnm1.pdf:
>> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
>> 
>> dnm2.pdf:
>> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:[email protected]] 
>> Sent: Friday 16 May 2014 10:20 AM
>> To: [email protected]
>> Subject: Re: Problem when extracting text from a pdf file
>> 
>> Hi Mehmet,
>> 
>> it could well be that text extraction works for one PDF and doesn't for 
>> another as it might not contain real text but what you see on screen is 
>> drawn. As the attachments didn't make it through because of restrictions on 
>> the mailing list could you upload these to a public location to take a look 
>> at the files so the answer can be more specific for your case?
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu 
>> <[email protected]>:
>> 
>>> Dear all,
>>> 
>>> As part of my research, I am trying to convert pdf files to text files. I 
>>> have applied both itext and pdfbox but I encounter the same issue.
>>> 
>>> When I try extracting text from dnm1.pdf file (attached) both approaches 
>>> work well. However when applying them for dnm2.pdf they fail.
>>> 
>>> I retrieve a text file with full of NULL values. Is it normal for such 
>>> differently shaped pdfs or am I missing something else?
>>> 
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Mehmet
>>> 
>>> 
>>> My code:
>>> 
>>> package retrievingfulltetxsfromweb;
>>> 
>>> import connectingurl.PlacesApi;
>>> 
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.io.IOException;
>>> import org.apache.pdfbox.cos.COSDocument;
>>> import org.apache.pdfbox.pdfparser.PDFParser;
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.util.PDFTextStripper;
>>> 
>>> public class PdfBox {
>>> 
>>>   // Extract text from PDF Document
>>>           public PdfBox(String fileName) {
>>>                   //PDFParser parser = new PDFParser();
>>>                   String parsedText = null;;
>>>                   PDFTextStripper pdfStripper = null;
>>>                   PDDocument pdDoc = null;
>>>                   COSDocument cosDoc = null;
>>>                   File file = new File(fileName);
>>>                   if (!file.isFile()) {
>>>                           System.err.println("File " + fileName + " does 
>>> not exist.");
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new 
>>> FileInputStream(file));
>>>                   } catch (IOException e) {
>>>                           System.err.println("Unable to open PDF Parser. " 
>>> + e.getMessage());
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new 
>>> FileInputStream(file));
>>>                           parser.parse();
>>>                           cosDoc = parser.getDocument();
>>>                           pdfStripper = new PDFTextStripper();
>>>                           pdDoc = new PDDocument(cosDoc);
>>>                           pdfStripper.setStartPage(1);
>>>                           pdfStripper.setEndPage(5);
>>>                           parsedText = pdfStripper.getText(pdDoc);
>>>                       System.out.println(parsedText);
>>>                   } catch (Exception e) {
>>>                           System.err
>>>                                           .println("An exception occured in 
>>> parsing the PDF Document."
>>>                                                           + e.getMessage());
>>>                   } finally {
>>>                           try {
>>>                                   if (cosDoc != null)
>>>                                           cosDoc.close();
>>>                                   if (pdDoc != null)
>>>                                           pdDoc.close();
>>>                           } catch (Exception e) {
>>>                                   e.printStackTrace();
>>>                           }
>>>                   }
>>>                   //return parsedText;
>>>           }
>>>           public static void main(String args[]){
>>> 
>>>               PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>>                  // System.out.println(pdftoText("C:/dnm1.pdf"));
>>>           }
>>> 
>>> }
>> 
>

Re: Problem when extracting text from a pdf file

Reply via email to