Re: Review Request 114632: Improve pdf title extraction

Albert Astals Cid Mon, 06 Jan 2014 12:44:57 -0800


> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
> > Hm, you broke the comment :)
> 
> Luis Silva wrote:
>     What do you mean? It all works fine here.
> 
> Christoph Feck wrote:
>     Yes, because the compiler does not read comments.
> 
> Thomas Lübking wrote:
>     Aside this, the approach seems too naive?
>     DOIs have a defined structure, leading "doi: 10" (ignoring the case and 
> making colon and whitespace optional) and in general the "problematic" tokens 
> will have a massive digit overhead - so this could be used as additional test 
> ( < 25 && looksLikeIndex())
> 
> Luis Silva wrote:
>     @Christoph: Just (finally) understood what you meant with "breaking the 
> comment". I uploaded a new patch that (hopefully) fixes the issue in the 
> correct way.
>     @Thomas: The approach was meant to be naive. In this simple form, this 
> patch takes care of all index-like cases as well as most other short garbage 
> titles without further parsing. What would be the point of actually knowing 
> if a very short title was actually a doi or an index?
> 
> Thomas Lübking wrote:
>     echo "The Lord of the Rings" | wc -m
>     22
>     
>     And that's not a short title - not to mention the typical Stephen King 
> ("It") or other languages that use hanzi, kanji or hanja and will never met 
> your arbitrary 25 glyph requirement.
>     Though many academic papers (in western cultures at least) in fact have 
> clumsy long titles, that doesn't hold for other document types.
>     
>     OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that 
> will easily outnumber 25 glyphs.
>     
>     So the more honest solution seems to just omit the title field altogether.
>     
>     The alternative (don't know how expensive the document scan is) would be 
> to check whether the title field seems like reasonable text, what could 
> invoke the digit ratio, the longest non-digit sequence ("0x12a21f56ea5") and 
> maybe whether there's any digitless word at all.

Honestly I don't even know why there is the rule for needing a space, looking 
at my shelf of books i can see "Cryptonomicon", "Azogue", "Portico", 
"Hyperion", "Endymion", "1984", and then I have stopped. Please, don't try to 
be that much clever, i can understand if you want to rule out stuff like 
"Microsoft Word - something.doc", but imho you're being already too broad with 
the rule of "it includes microsoft". What about if i have a manual about 
"Microsoft Visual Basic"?

Honestly omiting or mangling the title is a very bad thing to do. If you have a 
sensible thing to run over the 1500 test pdf files i have here i'm happy to 
help.

- Albert

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------

On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
> 
> (Updated Jan. 6, 2014, 5:47 p.m.)
> 
> 
> Review request for Baloo and Vishesh Handa.
> 
> 
> Repository: kfilemetadata
> 
> 
> Description
> -------
> 
> A good portion of scientific papers in my collection had a doi or an index 
> number in the title. These are in general short string chains, shorter than 
> the real title.
> I improve extraction of titles from pdf's by setting a minimum size below 
> which parsing of the first page is forced.
> The cut-off size is arbitrarily set to 25 characters (three "big words").
> 
> 
> Diffs
> -----
> 
>   src/extractors/popplerextractor.cpp 
> b056581f51d10b632799586eed3cc15ac539fe80 
> 
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
> 
> 
> Testing
> -------
> 
> This improved the title extraction on my pdf collection of scientific papers 
> by quite a lot.
> 
> 
> Thanks,
> 
> Luis Silva
> 
>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Re: Review Request 114632: Improve pdf title extraction

Reply via email to