Re: Review Request 114632: Improve pdf title extraction

2014-01-16 Thread Luis Silva

---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/
---

(Updated Jan. 16, 2014, 1:02 p.m.)


Status
--

This change has been discarded.


Review request for Baloo and Vishesh Handa.


Repository: kfilemetadata


Description
---

A good portion of scientific papers in my collection had a doi or an index 
number in the title. These are in general short string chains, shorter than the 
real title.
I improve extraction of titles from pdf's by setting a minimum size below which 
parsing of the first page is forced.
The cut-off size is arbitrarily set to 25 characters (three big words).


Diffs
-

  src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80 

Diff: https://git.reviewboard.kde.org/r/114632/diff/


Testing
---

This improved the title extraction on my pdf collection of scientific papers by 
quite a lot.


Thanks,

Luis Silva


 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-15 Thread Luis Silva


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?
 
 Thomas Lübking wrote:
 echo The Lord of the Rings | wc -m
 22
 
 And that's not a short title - not to mention the typical Stephen King 
 (It) or other languages that use hanzi, kanji or hanja and will never met 
 your arbitrary 25 glyph requirement.
 Though many academic papers (in western cultures at least) in fact have 
 clumsy long titles, that doesn't hold for other document types.
 
 OTOH, if the title (=index) is some (md5, sha*) hash of the text, that 
 will easily outnumber 25 glyphs.
 
 So the more honest solution seems to just omit the title field altogether.
 
 The alternative (don't know how expensive the document scan is) would be 
 to check whether the title field seems like reasonable text, what could 
 invoke the digit ratio, the longest non-digit sequence (0x12a21f56ea5) and 
 maybe whether there's any digitless word at all.
 
 Albert Astals Cid wrote:
 Honestly I don't even know why there is the rule for needing a space, 
 looking at my shelf of books i can see Cryptonomicon, Azogue, Portico, 
 Hyperion, Endymion, 1984, and then I have stopped. Please, don't try to 
 be that much clever, i can understand if you want to rule out stuff like 
 Microsoft Word - something.doc, but imho you're being already too broad 
 with the rule of it includes microsoft. What about if i have a manual about 
 Microsoft Visual Basic?
 
 Honestly omiting or mangling the title is a very bad thing to do. If you 
 have a sensible thing to run over the 1500 test pdf files i have here i'm 
 happy to help.
 
 Christoph Feck wrote:
 Would it make sense to refactor the code to use the (PDF supplied) 
 document title, and, if for whatever reason it is believed to be wrong, 
 append the extracted text that is believed to be a better title?
 
 Luis Silva wrote:
 I can see the point Albert is making that when a pdf has a short (but 
 valid) pdftitle and an unparseable first page the resulting extracted title 
 will be gibberish. I also agree that mangling the title just because it 
 seemed to be small is unacceptable. I must admit that I did not think about 
 the cases of hanzi, kanji or hanja for which this patch would systematically 
 force the parsing of the first page of the document. 
 The issue here is when the pdftitle does not match the real document 
 title. In my database of academic papers (700+) this happens a lot. Most of 
 my other documents are either prints to pdf, documents generated from their 
 latex source or Word documents converted to pdf most (90%) of which lack a 
 pdftitle and so have to be parsed anyway. From my experience this is a 
 typical situation, at least amongst academics.  Of course, the best operating 
 solution must cater for the most common personas, not just academics, but in 
 your experience, what would that be?
 
 Albert Astals Cid wrote:
 I'm with Christoph here, not sure what he use case for this is, but would 
 it be possible to add the extra information instead of replacing it? Maybe 
 even in a second field? Like title and thingwethinkmaybethetitle?
 
 Vishesh Handa wrote:
 The more I think about this, the more I realize how this is really not 
 required.
 
 Use Cases -
 
 1. Viewing the title - The title can currently only be seen via the 
 Dolphin sidebar
 2. Searching - It currently makes no difference if the text is in the 
 title or in the plain text. Both are currently given the same priority. In 
 the future we could give the title/any other field a higher priority, but 
 that has not been done.
 
 Given that the only real use case is (1), and it is debatable if Dolphin 
 users will actually care, perhaps we could remove this all together. This 
 could be implemented in a specialized application like Conquiere which is 
 built for Research Papers.

I agree with Vishesh. If the document text is indeed being extracted then, 
indeed, it should 

Re: Review Request 114632: Improve pdf title extraction

2014-01-08 Thread Vishesh Handa


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?
 
 Thomas Lübking wrote:
 echo The Lord of the Rings | wc -m
 22
 
 And that's not a short title - not to mention the typical Stephen King 
 (It) or other languages that use hanzi, kanji or hanja and will never met 
 your arbitrary 25 glyph requirement.
 Though many academic papers (in western cultures at least) in fact have 
 clumsy long titles, that doesn't hold for other document types.
 
 OTOH, if the title (=index) is some (md5, sha*) hash of the text, that 
 will easily outnumber 25 glyphs.
 
 So the more honest solution seems to just omit the title field altogether.
 
 The alternative (don't know how expensive the document scan is) would be 
 to check whether the title field seems like reasonable text, what could 
 invoke the digit ratio, the longest non-digit sequence (0x12a21f56ea5) and 
 maybe whether there's any digitless word at all.
 
 Albert Astals Cid wrote:
 Honestly I don't even know why there is the rule for needing a space, 
 looking at my shelf of books i can see Cryptonomicon, Azogue, Portico, 
 Hyperion, Endymion, 1984, and then I have stopped. Please, don't try to 
 be that much clever, i can understand if you want to rule out stuff like 
 Microsoft Word - something.doc, but imho you're being already too broad 
 with the rule of it includes microsoft. What about if i have a manual about 
 Microsoft Visual Basic?
 
 Honestly omiting or mangling the title is a very bad thing to do. If you 
 have a sensible thing to run over the 1500 test pdf files i have here i'm 
 happy to help.
 
 Christoph Feck wrote:
 Would it make sense to refactor the code to use the (PDF supplied) 
 document title, and, if for whatever reason it is believed to be wrong, 
 append the extracted text that is believed to be a better title?
 
 Luis Silva wrote:
 I can see the point Albert is making that when a pdf has a short (but 
 valid) pdftitle and an unparseable first page the resulting extracted title 
 will be gibberish. I also agree that mangling the title just because it 
 seemed to be small is unacceptable. I must admit that I did not think about 
 the cases of hanzi, kanji or hanja for which this patch would systematically 
 force the parsing of the first page of the document. 
 The issue here is when the pdftitle does not match the real document 
 title. In my database of academic papers (700+) this happens a lot. Most of 
 my other documents are either prints to pdf, documents generated from their 
 latex source or Word documents converted to pdf most (90%) of which lack a 
 pdftitle and so have to be parsed anyway. From my experience this is a 
 typical situation, at least amongst academics.  Of course, the best operating 
 solution must cater for the most common personas, not just academics, but in 
 your experience, what would that be?
 
 Albert Astals Cid wrote:
 I'm with Christoph here, not sure what he use case for this is, but would 
 it be possible to add the extra information instead of replacing it? Maybe 
 even in a second field? Like title and thingwethinkmaybethetitle?

The more I think about this, the more I realize how this is really not required.

Use Cases -

1. Viewing the title - The title can currently only be seen via the Dolphin 
sidebar
2. Searching - It currently makes no difference if the text is in the title or 
in the plain text. Both are currently given the same priority. In the future we 
could give the title/any other field a higher priority, but that has not been 
done.

Given that the only real use case is (1), and it is debatable if Dolphin users 
will actually care, perhaps we could remove this all together. This could be 
implemented in a specialized application like Conquiere which is built for 
Research Papers.


- Vishesh


---
This is an automatically generated e-mail. To reply, visit:

Re: Review Request 114632: Improve pdf title extraction

2014-01-07 Thread Albert Astals Cid


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?
 
 Thomas Lübking wrote:
 echo The Lord of the Rings | wc -m
 22
 
 And that's not a short title - not to mention the typical Stephen King 
 (It) or other languages that use hanzi, kanji or hanja and will never met 
 your arbitrary 25 glyph requirement.
 Though many academic papers (in western cultures at least) in fact have 
 clumsy long titles, that doesn't hold for other document types.
 
 OTOH, if the title (=index) is some (md5, sha*) hash of the text, that 
 will easily outnumber 25 glyphs.
 
 So the more honest solution seems to just omit the title field altogether.
 
 The alternative (don't know how expensive the document scan is) would be 
 to check whether the title field seems like reasonable text, what could 
 invoke the digit ratio, the longest non-digit sequence (0x12a21f56ea5) and 
 maybe whether there's any digitless word at all.
 
 Albert Astals Cid wrote:
 Honestly I don't even know why there is the rule for needing a space, 
 looking at my shelf of books i can see Cryptonomicon, Azogue, Portico, 
 Hyperion, Endymion, 1984, and then I have stopped. Please, don't try to 
 be that much clever, i can understand if you want to rule out stuff like 
 Microsoft Word - something.doc, but imho you're being already too broad 
 with the rule of it includes microsoft. What about if i have a manual about 
 Microsoft Visual Basic?
 
 Honestly omiting or mangling the title is a very bad thing to do. If you 
 have a sensible thing to run over the 1500 test pdf files i have here i'm 
 happy to help.
 
 Christoph Feck wrote:
 Would it make sense to refactor the code to use the (PDF supplied) 
 document title, and, if for whatever reason it is believed to be wrong, 
 append the extracted text that is believed to be a better title?
 
 Luis Silva wrote:
 I can see the point Albert is making that when a pdf has a short (but 
 valid) pdftitle and an unparseable first page the resulting extracted title 
 will be gibberish. I also agree that mangling the title just because it 
 seemed to be small is unacceptable. I must admit that I did not think about 
 the cases of hanzi, kanji or hanja for which this patch would systematically 
 force the parsing of the first page of the document. 
 The issue here is when the pdftitle does not match the real document 
 title. In my database of academic papers (700+) this happens a lot. Most of 
 my other documents are either prints to pdf, documents generated from their 
 latex source or Word documents converted to pdf most (90%) of which lack a 
 pdftitle and so have to be parsed anyway. From my experience this is a 
 typical situation, at least amongst academics.  Of course, the best operating 
 solution must cater for the most common personas, not just academics, but in 
 your experience, what would that be?

I'm with Christoph here, not sure what he use case for this is, but would it be 
possible to add the extra information instead of replacing it? Maybe even in a 
second field? Like title and thingwethinkmaybethetitle?


- Albert


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Jan. 6, 2014, 5:47 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from 

Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Luis Silva


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)

What do you mean? It all works fine here. 


- Luis


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Dec. 23, 2013, 4:14 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Christoph Feck


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.

Yes, because the compiler does not read comments.


- Christoph


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Dec. 23, 2013, 4:14 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Luis Silva

---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/
---

(Updated Jan. 6, 2014, 5:47 p.m.)


Review request for Baloo and Vishesh Handa.


Repository: kfilemetadata


Description
---

A good portion of scientific papers in my collection had a doi or an index 
number in the title. These are in general short string chains, shorter than the 
real title.
I improve extraction of titles from pdf's by setting a minimum size below which 
parsing of the first page is forced.
The cut-off size is arbitrarily set to 25 characters (three big words).


Diffs (updated)
-

  src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80 

Diff: https://git.reviewboard.kde.org/r/114632/diff/


Testing
---

This improved the title extraction on my pdf collection of scientific papers by 
quite a lot.


Thanks,

Luis Silva


 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Luis Silva


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())

@Christoph: Just (finally) understood what you meant with breaking the 
comment. I uploaded a new patch that (hopefully) fixes the issue in the 
correct way.
@Thomas: The approach was meant to be naive. In this simple form, this patch 
takes care of all index-like cases as well as most other short garbage titles 
without further parsing. What would be the point of actually knowing if a very 
short title was actually a doi or an index?


- Luis


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Dec. 23, 2013, 4:14 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Thomas Lübking


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?

echo The Lord of the Rings | wc -m
22

And that's not a short title - not to mention the typical Stephen King (It) 
or other languages that use hanzi, kanji or hanja and will never met your 
arbitrary 25 glyph requirement.
Though many academic papers (in western cultures at least) in fact have clumsy 
long titles, that doesn't hold for other document types.

OTOH, if the title (=index) is some (md5, sha*) hash of the text, that will 
easily outnumber 25 glyphs.

So the more honest solution seems to just omit the title field altogether.

The alternative (don't know how expensive the document scan is) would be to 
check whether the title field seems like reasonable text, what could invoke the 
digit ratio, the longest non-digit sequence (0x12a21f56ea5) and maybe whether 
there's any digitless word at all.


- Thomas


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Jan. 6, 2014, 5:47 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Albert Astals Cid


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?
 
 Thomas Lübking wrote:
 echo The Lord of the Rings | wc -m
 22
 
 And that's not a short title - not to mention the typical Stephen King 
 (It) or other languages that use hanzi, kanji or hanja and will never met 
 your arbitrary 25 glyph requirement.
 Though many academic papers (in western cultures at least) in fact have 
 clumsy long titles, that doesn't hold for other document types.
 
 OTOH, if the title (=index) is some (md5, sha*) hash of the text, that 
 will easily outnumber 25 glyphs.
 
 So the more honest solution seems to just omit the title field altogether.
 
 The alternative (don't know how expensive the document scan is) would be 
 to check whether the title field seems like reasonable text, what could 
 invoke the digit ratio, the longest non-digit sequence (0x12a21f56ea5) and 
 maybe whether there's any digitless word at all.

Honestly I don't even know why there is the rule for needing a space, looking 
at my shelf of books i can see Cryptonomicon, Azogue, Portico, 
Hyperion, Endymion, 1984, and then I have stopped. Please, don't try to 
be that much clever, i can understand if you want to rule out stuff like 
Microsoft Word - something.doc, but imho you're being already too broad with 
the rule of it includes microsoft. What about if i have a manual about 
Microsoft Visual Basic?

Honestly omiting or mangling the title is a very bad thing to do. If you have a 
sensible thing to run over the 1500 test pdf files i have here i'm happy to 
help.


- Albert


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Jan. 6, 2014, 5:47 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe 


Re: Review Request 114632: Improve pdf title extraction

2014-01-06 Thread Christoph Feck


 On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
  Hm, you broke the comment :)
 
 Luis Silva wrote:
 What do you mean? It all works fine here.
 
 Christoph Feck wrote:
 Yes, because the compiler does not read comments.
 
 Thomas Lübking wrote:
 Aside this, the approach seems too naive?
 DOIs have a defined structure, leading doi: 10 (ignoring the case and 
 making colon and whitespace optional) and in general the problematic tokens 
 will have a massive digit overhead - so this could be used as additional test 
 (  25  looksLikeIndex())
 
 Luis Silva wrote:
 @Christoph: Just (finally) understood what you meant with breaking the 
 comment. I uploaded a new patch that (hopefully) fixes the issue in the 
 correct way.
 @Thomas: The approach was meant to be naive. In this simple form, this 
 patch takes care of all index-like cases as well as most other short garbage 
 titles without further parsing. What would be the point of actually knowing 
 if a very short title was actually a doi or an index?
 
 Thomas Lübking wrote:
 echo The Lord of the Rings | wc -m
 22
 
 And that's not a short title - not to mention the typical Stephen King 
 (It) or other languages that use hanzi, kanji or hanja and will never met 
 your arbitrary 25 glyph requirement.
 Though many academic papers (in western cultures at least) in fact have 
 clumsy long titles, that doesn't hold for other document types.
 
 OTOH, if the title (=index) is some (md5, sha*) hash of the text, that 
 will easily outnumber 25 glyphs.
 
 So the more honest solution seems to just omit the title field altogether.
 
 The alternative (don't know how expensive the document scan is) would be 
 to check whether the title field seems like reasonable text, what could 
 invoke the digit ratio, the longest non-digit sequence (0x12a21f56ea5) and 
 maybe whether there's any digitless word at all.
 
 Albert Astals Cid wrote:
 Honestly I don't even know why there is the rule for needing a space, 
 looking at my shelf of books i can see Cryptonomicon, Azogue, Portico, 
 Hyperion, Endymion, 1984, and then I have stopped. Please, don't try to 
 be that much clever, i can understand if you want to rule out stuff like 
 Microsoft Word - something.doc, but imho you're being already too broad 
 with the rule of it includes microsoft. What about if i have a manual about 
 Microsoft Visual Basic?
 
 Honestly omiting or mangling the title is a very bad thing to do. If you 
 have a sensible thing to run over the 1500 test pdf files i have here i'm 
 happy to help.

Would it make sense to refactor the code to use the (PDF supplied) document 
title, and, if for whatever reason it is believed to be wrong, append the 
extracted text that is believed to be a better title?


- Christoph


---
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
---


On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://git.reviewboard.kde.org/r/114632/
 ---
 
 (Updated Jan. 6, 2014, 5:47 p.m.)
 
 
 Review request for Baloo and Vishesh Handa.
 
 
 Repository: kfilemetadata
 
 
 Description
 ---
 
 A good portion of scientific papers in my collection had a doi or an index 
 number in the title. These are in general short string chains, shorter than 
 the real title.
 I improve extraction of titles from pdf's by setting a minimum size below 
 which parsing of the first page is forced.
 The cut-off size is arbitrarily set to 25 characters (three big words).
 
 
 Diffs
 -
 
   src/extractors/popplerextractor.cpp 
 b056581f51d10b632799586eed3cc15ac539fe80 
 
 Diff: https://git.reviewboard.kde.org/r/114632/diff/
 
 
 Testing
 ---
 
 This improved the title extraction on my pdf collection of scientific papers 
 by quite a lot.
 
 
 Thanks,
 
 Luis Silva
 



 Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe