Hi Christian, For text extraction I used the PDFTextStripper from PDFBox (https://github.com/apache/pdfbox/blob/5991a69ecbcd53775f685755a399304d04accfa2/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java <https://github.com/apache/pdfbox/blob/5991a69ecbcd53775f685755a399304d04accfa2/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java>). It’s not perfect, but good enough. Maybe this can serve as an oracle for PDFtalk on how to do it. It basically parses the PDF into a DOM of words, lines, paragraphs, and pages. Repeating myself from an earlier thread, I would be very interested in a Pharo port of PDFtalk. What I currently need use text extraction, manipulation of annotations and content streams for drawing rectangles.
Cheers, Manuel > On 3 Nov 2017, at 10:15, Christian Haider > <christian.hai...@smalltalked-visuals.com> wrote: > > Yes, reading PDFs is fine with PDFtalk, but what you want is more: Text > extraction (there is a chapter in the spec about this). > This feature is not yet readily available. Some ground work has been done > (content analysis), but for full text extraction more work is needed. > > Cheers, > Christian > >> -----Ursprüngliche Nachricht----- >> Von: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] Im Auftrag >> von Stephane Ducasse >> Gesendet: Freitag, 3. November 2017 09:46 >> An: Any question about pharo is welcome <pharo-users@lists.pharo.org> >> Betreff: Re: [Pharo-users] LiteratureResearcher - where graphs, PDFs, and >> BibTex happily live together >> >> Hi manuel >> >> thanks for the details. I think that the framework of christian haidler >> should >> be able to read pdf. >> >> Stef >> >> On Thu, Nov 2, 2017 at 8:33 PM, Manuel Leuenberger >> <leuenber...@inf.unibe.ch> wrote: >>> Hi Stef, >>> >>> The PDF integration consists of three parts: >>> >>> 1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and >>> outputs metadata as BibTex and a structured XML (title, authors, >>> affiliations, abstract, keyword, references, …). This is not perfect, >>> but way better than any other metadata extractor I could find. >>> 2. From the metadata I generate hyperlinks that are anchored in the >>> PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker) >>> then searches for the anchors in the PDF text, using heuristics, as >>> PDF has a document model that is primarily intended for rendering and >>> printing, but not for processing. The hyperlinks are then inserted >>> using the awesome Apache PDFBox (https://pdfbox.apache.org/). >>> 3. Those hyperlinks point to an URI like >>> “pharo://handle/clickReference.in.?args=1&args=2” to represent a >>> reference 1 in the paper 2. Now comes the magic part: The OS allows >>> you to register custom handlers for custom URI schemes like pharo://. >>> For that I created a simple Objective-C app that handles the event and >>> passes it over as a HTTP message to a server running in Pharo >>> (https://github.com/maenu/PharoUriScheme). The OS will even start the >>> application if it is not yet running. >>> >>> While the custom URI scheme approach is super powerful, it has >>> critical drawbacks. Any application can request to be the receiver of >>> a URI scheme, just as browser are for http://. Especially on mobile >>> devices with limited access to the OS, this opens up an attack point >>> for malware apps that replicate original apps that make use of schemes >>> like facebook:// and eavesdrop all interactions. If an original app >>> transmits any unencrypted secrets or user data encoded in those URIs, >>> malware can easily intercept it without the user noticing the leak. I >>> guess this is the reason why many PDF viewer just support the standard >>> http:// and mailto:// schemes. E.g., macOS Preview gives just an >>> audible beep when I click on a pharo:// link, Chromes viewer doesn’t >>> even bother giving any feedback. Only Adobe Acrobat allows you to >>> relax security settings to make them work (How could it be someone else >> than Adobe, when it’s a security issue? ;). >>> >>> I finished basic packaging today and will continue with some READMEs >>> and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in >>> this thread. >>> >>> Cheers, >>> Manuel >>> >>> On 2 Nov 2017, at 18:08, Stephane Ducasse <stepharo.s...@gmail.com> >> wrote: >>> >>> Hi manuel >>> >>> this is super cool :) >>> Could you describe how you did the pdf integration? >>> And yes please package it :) >>> I want to try it. >>> >>> Stef >>> >>> On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger >>> <leuenber...@inf.unibe.ch> wrote: >>> >>> Hi everyone, >>> >>> I was experimenting in the last few weeks with my take on literature >>> research. For me, the corpus of scientific papers form an >>> interconnected graph, not those plain lists and tables we keep in our >>> bibliographies. So, here is the first prototype that has Google >>> Scholar integration for search, can fetch PDFs from IEEE and ACM, >>> extracts metadata from PDFs - all this results in hyperlinked PDFs! >>> >>> See a demo here: https://youtu.be/EcK3Pt_WnEw Also slides from the >> SCG >>> seminar here: >>> http://scg.unibe.ch/download/softwarecomposition/2017-10-31- >> Leuenberge >>> r-ILE.pdf >>> >>> I plan on packaging it, so that those who are interested can check it >>> out themselves (help wanted!). Currently, it only works on macOS. >>> >>> What do you think of my approach? Which use cases should be added? >>> >>> Cheers, >>> Manuel >>> >>> >>> > > >