Re: Review Request 114632: Improve pdf title extraction

Thomas Lübking Mon, 06 Jan 2014 12:24:17 -0800


> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
> > Hm, you broke the comment :)
> 
> Luis Silva wrote:
>     What do you mean? It all works fine here.
> 
> Christoph Feck wrote:
>     Yes, because the compiler does not read comments.
> 
> Thomas Lübking wrote:
>     Aside this, the approach seems too naive?
>     DOIs have a defined structure, leading "doi: 10" (ignoring the case and 
> making colon and whitespace optional) and in general the "problematic" tokens 
> will have a massive digit overhead - so this could be used as additional test 
> ( < 25 && looksLikeIndex())
> 
> Luis Silva wrote:
>     @Christoph: Just (finally) understood what you meant with "breaking the 
> comment". I uploaded a new patch that (hopefully) fixes the issue in the 
> correct way.
>     @Thomas: The approach was meant to be naive. In this simple form, this 
> patch takes care of all index-like cases as well as most other short garbage 
> titles without further parsing. What would be the point of actually knowing 
> if a very short title was actually a doi or an index?

echo "The Lord of the Rings" | wc -m
22

And that's not a short title - not to mention the typical Stephen King ("It") 
or other languages that use hanzi, kanji or hanja and will never met your 
arbitrary 25 glyph requirement.
Though many academic papers (in western cultures at least) in fact have clumsy 
long titles, that doesn't hold for other document types.

OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that will 
easily outnumber 25 glyphs.

So the more honest solution seems to just omit the title field altogether.

The alternative (don't know how expensive the document scan is) would be to 
check whether the title field seems like reasonable text, what could invoke the 
digit ratio, the longest non-digit sequence ("0x12a21f56ea5") and maybe whether 
there's any digitless word at all.

- Thomas

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------

On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
> 
> (Updated Jan. 6, 2014, 5:47 p.m.)
> 
> 
> Review request for Baloo and Vishesh Handa.
> 
> 
> Repository: kfilemetadata
> 
> 
> Description
> -------
> 
> A good portion of scientific papers in my collection had a doi or an index 
> number in the title. These are in general short string chains, shorter than 
> the real title.
> I improve extraction of titles from pdf's by setting a minimum size below 
> which parsing of the first page is forced.
> The cut-off size is arbitrarily set to 25 characters (three "big words").
> 
> 
> Diffs
> -----
> 
>   src/extractors/popplerextractor.cpp 
> b056581f51d10b632799586eed3cc15ac539fe80 
> 
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
> 
> 
> Testing
> -------
> 
> This improved the title extraction on my pdf collection of scientific papers 
> by quite a lot.
> 
> 
> Thanks,
> 
> Luis Silva
> 
>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Re: Review Request 114632: Improve pdf title extraction

Reply via email to