Hi Mehmet, sry - now I see your issue. It’s an encoding issue of the PDF. Copying & Pasting using Adobe Reader gives the same result. I don’t think that we can do very much about it but I’ll look into it in more detail.
BR Maruan Sahyoun Am 21.05.2014 um 11:06 schrieb Mehmet Ali Abdulhayoglu <[email protected]>: > Dear Maruan, > > I have checked them again. I am sure that they are correct ones. > > The pdf coming from the first link has a title of "Olfactory Learning-Induced > Increase in Spine Density Along the...Neurons". I can process for this pdf. > > The second one has a title : "Relationship between intercepted radiation, net > photosynthesis, respiration, and rate of ....densities". I could not handle > this one. > > Indeed, when I copy and paste some text from this pdf, what I get is like: > > * > > > > > > When you extract the text from the second one, did you make use of the java > script that I sent in my first mail or use another one? > > > Thanks for your attention. > > Best regards, > Mehmet > > -----Original Message----- > From: Maruan Sahyoun [mailto:[email protected]] > Sent: Wednesday 21 May 2014 8:51 AM > To: [email protected] > Subject: Re: Problem when extracting text from a pdf file > > Dear Mehmet, > > did you supply the correct PDF's? I can manual copy & paste text from both as > well as extract the text using PDFBox for both. > > BR > > Maruan Sahyoun > > Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu > <[email protected]>: > >> Dear Maruan, >> >> Thanks for your reply. Below you can find the related links for the pdf >> files. As you state, from the first pdf (dnm1) I can manually copy paste the >> text while this is not possible for the second one (pdf) which shows that >> the later one contains no real text. >> >> Is there any other ways to extract text from such pdfs like dnm2? >> >> dnm1.pdf: >> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf >> >> dnm2.pdf: >> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf >> >> Regards, >> Mehmet >> >> >> >> >> -----Original Message----- >> From: Maruan Sahyoun [mailto:[email protected]] >> Sent: Friday 16 May 2014 10:20 AM >> To: [email protected] >> Subject: Re: Problem when extracting text from a pdf file >> >> Hi Mehmet, >> >> it could well be that text extraction works for one PDF and doesn't for >> another as it might not contain real text but what you see on screen is >> drawn. As the attachments didn't make it through because of restrictions on >> the mailing list could you upload these to a public location to take a look >> at the files so the answer can be more specific for your case? >> >> BR >> >> Maruan Sahyoun >> >> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu >> <[email protected]>: >> >>> Dear all, >>> >>> As part of my research, I am trying to convert pdf files to text files. I >>> have applied both itext and pdfbox but I encounter the same issue. >>> >>> When I try extracting text from dnm1.pdf file (attached) both approaches >>> work well. However when applying them for dnm2.pdf they fail. >>> >>> I retrieve a text file with full of NULL values. Is it normal for such >>> differently shaped pdfs or am I missing something else? >>> >>> Thanks in advance. >>> >>> Regards, >>> Mehmet >>> >>> >>> My code: >>> >>> package retrievingfulltetxsfromweb; >>> >>> import connectingurl.PlacesApi; >>> >>> import java.io.File; >>> import java.io.FileInputStream; >>> import java.io.IOException; >>> import org.apache.pdfbox.cos.COSDocument; >>> import org.apache.pdfbox.pdfparser.PDFParser; >>> import org.apache.pdfbox.pdmodel.PDDocument; >>> import org.apache.pdfbox.util.PDFTextStripper; >>> >>> public class PdfBox { >>> >>> // Extract text from PDF Document >>> public PdfBox(String fileName) { >>> //PDFParser parser = new PDFParser(); >>> String parsedText = null;; >>> PDFTextStripper pdfStripper = null; >>> PDDocument pdDoc = null; >>> COSDocument cosDoc = null; >>> File file = new File(fileName); >>> if (!file.isFile()) { >>> System.err.println("File " + fileName + " does >>> not exist."); >>> //return null; >>> } >>> try { >>> PDFParser parser = new PDFParser(new >>> FileInputStream(file)); >>> } catch (IOException e) { >>> System.err.println("Unable to open PDF Parser. " >>> + e.getMessage()); >>> //return null; >>> } >>> try { >>> PDFParser parser = new PDFParser(new >>> FileInputStream(file)); >>> parser.parse(); >>> cosDoc = parser.getDocument(); >>> pdfStripper = new PDFTextStripper(); >>> pdDoc = new PDDocument(cosDoc); >>> pdfStripper.setStartPage(1); >>> pdfStripper.setEndPage(5); >>> parsedText = pdfStripper.getText(pdDoc); >>> System.out.println(parsedText); >>> } catch (Exception e) { >>> System.err >>> .println("An exception occured in >>> parsing the PDF Document." >>> + e.getMessage()); >>> } finally { >>> try { >>> if (cosDoc != null) >>> cosDoc.close(); >>> if (pdDoc != null) >>> pdDoc.close(); >>> } catch (Exception e) { >>> e.printStackTrace(); >>> } >>> } >>> //return parsedText; >>> } >>> public static void main(String args[]){ >>> >>> PdfBox pdf = new PdfBox("C:/dnm1.pdf"); >>> // System.out.println(pdftoText("C:/dnm1.pdf")); >>> } >>> >>> } >> >

