Hi Christian,

For text extraction I used the PDFTextStripper from PDFBox 
(https://github.com/apache/pdfbox/blob/5991a69ecbcd53775f685755a399304d04accfa2/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
 
<https://github.com/apache/pdfbox/blob/5991a69ecbcd53775f685755a399304d04accfa2/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java>).
 It’s not perfect, but good enough. Maybe this can serve as an oracle for 
PDFtalk on how to do it. It basically parses the PDF into a DOM of words, 
lines, paragraphs, and pages.
Repeating myself from an earlier thread, I would be very interested in a Pharo 
port of PDFtalk. What I currently need use text extraction, manipulation of 
annotations and content streams for drawing rectangles.

Cheers,
Manuel

> On 3 Nov 2017, at 10:15, Christian Haider 
> <christian.hai...@smalltalked-visuals.com> wrote:
> 
> Yes, reading PDFs is fine with PDFtalk, but what you want is more: Text 
> extraction (there is a chapter in the spec about this). 
> This feature is not yet readily available. Some ground work has been done 
> (content analysis), but for full text extraction more work is needed.
> 
> Cheers,
>       Christian
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] Im Auftrag
>> von Stephane Ducasse
>> Gesendet: Freitag, 3. November 2017 09:46
>> An: Any question about pharo is welcome <pharo-users@lists.pharo.org>
>> Betreff: Re: [Pharo-users] LiteratureResearcher - where graphs, PDFs, and
>> BibTex happily live together
>> 
>> Hi manuel
>> 
>> thanks for the details. I think that the framework of christian haidler 
>> should
>> be able to read pdf.
>> 
>> Stef
>> 
>> On Thu, Nov 2, 2017 at 8:33 PM, Manuel Leuenberger
>> <leuenber...@inf.unibe.ch> wrote:
>>> Hi Stef,
>>> 
>>> The PDF integration consists of three parts:
>>> 
>>> 1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and
>>> outputs metadata as BibTex and a structured XML (title, authors,
>>> affiliations, abstract, keyword, references, …). This is not perfect,
>>> but way better than any other metadata extractor I could find.
>>> 2. From the metadata I generate hyperlinks that are anchored in the
>>> PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker)
>>> then searches for the anchors in the PDF text, using heuristics, as
>>> PDF has a document model that is primarily intended for rendering and
>>> printing, but not for processing. The hyperlinks are then inserted
>>> using the awesome Apache PDFBox (https://pdfbox.apache.org/).
>>> 3. Those hyperlinks point to an URI like
>>> “pharo://handle/clickReference.in.?args=1&args=2” to represent a
>>> reference 1 in the paper 2. Now comes the magic part: The OS allows
>>> you to register custom handlers for custom URI schemes like pharo://.
>>> For that I created a simple Objective-C app that handles the event and
>>> passes it over as a HTTP message to a server running in Pharo
>>> (https://github.com/maenu/PharoUriScheme). The OS will even start the
>>> application if it is not yet running.
>>> 
>>> While the custom URI scheme approach is super powerful, it has
>>> critical drawbacks. Any application can request to be the receiver of
>>> a URI scheme, just as browser are for http://. Especially on mobile
>>> devices with limited access to the OS, this opens up an attack point
>>> for malware apps that replicate original apps that make use of schemes
>>> like facebook:// and eavesdrop all interactions. If an original app
>>> transmits any unencrypted secrets or user data encoded in those URIs,
>>> malware can easily intercept it without the user noticing the leak. I
>>> guess this is the reason why many PDF viewer just support the standard
>>> http:// and mailto:// schemes. E.g., macOS Preview gives just an
>>> audible beep when I click on a pharo:// link, Chromes viewer doesn’t
>>> even bother giving any feedback. Only Adobe Acrobat allows you to
>>> relax security settings to make them work (How could it be someone else
>> than Adobe, when it’s a security issue? ;).
>>> 
>>> I finished basic packaging today and will continue with some READMEs
>>> and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in
>>> this thread.
>>> 
>>> Cheers,
>>> Manuel
>>> 
>>> On 2 Nov 2017, at 18:08, Stephane Ducasse <stepharo.s...@gmail.com>
>> wrote:
>>> 
>>> Hi manuel
>>> 
>>> this is super cool :)
>>> Could you describe how you did the pdf integration?
>>> And yes please package it :)
>>> I want to try it.
>>> 
>>> Stef
>>> 
>>> On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
>>> <leuenber...@inf.unibe.ch> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> I was experimenting in the last few weeks with my take on literature
>>> research. For me, the corpus of scientific papers form an
>>> interconnected graph, not those plain lists and tables we keep in our
>>> bibliographies. So, here is the first prototype that has Google
>>> Scholar integration for search, can fetch PDFs from IEEE and ACM,
>>> extracts metadata from PDFs - all this results in hyperlinked PDFs!
>>> 
>>> See a demo here: https://youtu.be/EcK3Pt_WnEw Also slides from the
>> SCG
>>> seminar here:
>>> http://scg.unibe.ch/download/softwarecomposition/2017-10-31-
>> Leuenberge
>>> r-ILE.pdf
>>> 
>>> I plan on packaging it, so that those who are interested can check it
>>> out themselves (help wanted!). Currently, it only works on macOS.
>>> 
>>> What do you think of my approach? Which use cases should be added?
>>> 
>>> Cheers,
>>> Manuel
>>> 
>>> 
>>> 
> 
> 
> 

Reply via email to