> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote: > > Hm, you broke the comment :) > > Luis Silva wrote: > What do you mean? It all works fine here. > > Christoph Feck wrote: > Yes, because the compiler does not read comments. > > Thomas Lübking wrote: > Aside this, the approach seems too naive? > DOIs have a defined structure, leading "doi: 10" (ignoring the case and > making colon and whitespace optional) and in general the "problematic" tokens > will have a massive digit overhead - so this could be used as additional test > ( < 25 && looksLikeIndex()) > > Luis Silva wrote: > @Christoph: Just (finally) understood what you meant with "breaking the > comment". I uploaded a new patch that (hopefully) fixes the issue in the > correct way. > @Thomas: The approach was meant to be naive. In this simple form, this > patch takes care of all index-like cases as well as most other short garbage > titles without further parsing. What would be the point of actually knowing > if a very short title was actually a doi or an index? > > Thomas Lübking wrote: > echo "The Lord of the Rings" | wc -m > 22 > > And that's not a short title - not to mention the typical Stephen King > ("It") or other languages that use hanzi, kanji or hanja and will never met > your arbitrary 25 glyph requirement. > Though many academic papers (in western cultures at least) in fact have > clumsy long titles, that doesn't hold for other document types. > > OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that > will easily outnumber 25 glyphs. > > So the more honest solution seems to just omit the title field altogether. > > The alternative (don't know how expensive the document scan is) would be > to check whether the title field seems like reasonable text, what could > invoke the digit ratio, the longest non-digit sequence ("0x12a21f56ea5") and > maybe whether there's any digitless word at all.
Honestly I don't even know why there is the rule for needing a space, looking at my shelf of books i can see "Cryptonomicon", "Azogue", "Portico", "Hyperion", "Endymion", "1984", and then I have stopped. Please, don't try to be that much clever, i can understand if you want to rule out stuff like "Microsoft Word - something.doc", but imho you're being already too broad with the rule of "it includes microsoft". What about if i have a manual about "Microsoft Visual Basic"? Honestly omiting or mangling the title is a very bad thing to do. If you have a sensible thing to run over the 1500 test pdf files i have here i'm happy to help. - Albert ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/114632/#review46156 ----------------------------------------------------------- On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://git.reviewboard.kde.org/r/114632/ > ----------------------------------------------------------- > > (Updated Jan. 6, 2014, 5:47 p.m.) > > > Review request for Baloo and Vishesh Handa. > > > Repository: kfilemetadata > > > Description > ------- > > A good portion of scientific papers in my collection had a doi or an index > number in the title. These are in general short string chains, shorter than > the real title. > I improve extraction of titles from pdf's by setting a minimum size below > which parsing of the first page is forced. > The cut-off size is arbitrarily set to 25 characters (three "big words"). > > > Diffs > ----- > > src/extractors/popplerextractor.cpp > b056581f51d10b632799586eed3cc15ac539fe80 > > Diff: https://git.reviewboard.kde.org/r/114632/diff/ > > > Testing > ------- > > This improved the title extraction on my pdf collection of scientific papers > by quite a lot. > > > Thanks, > > Luis Silva > >
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<