[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562697#comment-14562697 ] Moritz Dorka commented on TIKA-1315: The case of none for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562699#comment-14562699 ] Moritz Dorka commented on TIKA-1315: The case of none for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1315: --- Comment: was deleted (was: The case of none for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs.) Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1315: --- Comment: was deleted (was: The case of none for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs.) Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560750#comment-14560750 ] Moritz Dorka commented on TIKA-1315: bq. For test 2, how did you get 1.b.III? It's been quite a while since I authored that file. But from first glance I suppose this happens because the restartLim is not set to the first (ordinary case) but the second most-significant ilvl. This means the third level will only see a reset each time an item belonging to the first level occurs. Since 1.b, which precedes the element in question, belongs to the second level no such reset happens and the II from 1.a.II gets incremented, instead. See the definition of [ilvlRestartLim|https://msdn.microsoft.com/en-us/library/dd923594%28v=office.12%29.aspx] for a more complicated explanation. I do not know Word's XML-stuff, but given the logic hasn't changed the above would mean you should somewhere encounter a ilvlRestartLim of 1 (not 3!) associated with the currently applicable lvl (which may come from an override). Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1315: --- Attachment: complex_list_test.doc Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528359#comment-14528359 ] Moritz Dorka commented on TIKA-1315: [~talli...@mitre.org], I've attached [^complex_list_test.doc] to this bug which exhibits _some_, but unfortunately not all, possible list numbering traps. A correct algorithm should compute the same value for both the numberText and the paragraph content. However, I wasn't able to squeeze all possible circumstances under which Word creates those nasty {{ListFormatOverrideLevels}} into this test file. The general idea (and this might help you for .docx, which I do not have access to) is to create a list in Word, select a few entries, open the list's formatting properties, make some changes, go into the advanced formatting features (in Word 2003 that is a button in the lower right part of the dialogue which causes the window to expand) and let those changes only apply to the current selection. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Assignee: Tim Allison Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, complex_list_test.doc Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505004#comment-14505004 ] Moritz Dorka commented on TIKA-1315: Well, the original patch by Filip is essentially an 80% solution. Everything that I added is rather obscure functionality... Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505042#comment-14505042 ] Moritz Dorka commented on TIKA-1315: I believe I could speed up the process by ultimately writing a unit test for the POI-part... I'm just having a hard time motivating myself to write unit tests for a few stupid getters. What you could also do is to hardcode {code}getLevelNumberingPlaceholderOffsets(){code} to always return {code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all (trivial) cases (however, I have not tested the reaction of my code to such cheating). There is also a very subtle bug left in my code which only triggers in ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find the time I will update my patch. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.9 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1468) Symbol character handling in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1468: --- Attachment: WordParserTest.patch testWORD_specialcharacters.tar.bz2 Requested jUnit testcase Symbol character handling in WordExtractor -- Key: TIKA-1468 URL: https://issues.apache.org/jira/browse/TIKA-1468 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Moritz Dorka Priority: Minor Attachments: WordExtractor.patch, WordParserTest.patch, testWORD_specialcharacters.tar.bz2 Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word). Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1468) Symbol character handling in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213535#comment-14213535 ] Moritz Dorka commented on TIKA-1468: So here is a jUnit testcase which relies on the special handling of characters from the Symbol font. The Microsoft specs talk about a case where these special characters already come in their unicode representation (thus triggering the fallback in [^WordExtractor.patch]). However, I have no idea how to create a Word file that actually shows this behavior... Regarding the location of logic: Does TIKA actually make use of POI's {{AbstractWordConverter}}? Symbol character handling in WordExtractor -- Key: TIKA-1468 URL: https://issues.apache.org/jira/browse/TIKA-1468 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Moritz Dorka Priority: Minor Attachments: WordExtractor.patch, WordParserTest.patch, testWORD_specialcharacters.tar.bz2 Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word). Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1468) Symbol character handling in WordExtractor
Moritz Dorka created TIKA-1468: -- Summary: Symbol character handling in WordExtractor Key: TIKA-1468 URL: https://issues.apache.org/jira/browse/TIKA-1468 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Moritz Dorka Priority: Minor Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word). Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1468) Symbol character handling in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1468: --- Attachment: WordExtractor.patch Symbol character handling in WordExtractor -- Key: TIKA-1468 URL: https://issues.apache.org/jira/browse/TIKA-1468 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Moritz Dorka Priority: Minor Attachments: WordExtractor.patch Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word). Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142495#comment-14142495 ] Moritz Dorka edited comment on TIKA-1315 at 9/22/14 8:14 AM: - Hmm, apparently, files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to [^ListManager.tar.bz2] and [^ListNumbering.patch] which I propose as substitutes for Filip's work. \\ The original patch proposed by Filip is quite good but * it lacks true support for ListFormatOverrideLevels (which, admittedly, is a really brain-twisting feature of Word) * it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists * there is no support for legal formatting and * no support for levels which restart at arbitrary more-significant levels. Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods ({{intToRoman()}} + {{intToLetter()}}) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika. The code is an attempt to fully implement the algorithm outlined in MS-DOC, v20140721, [2.4.6.3|http://msdn.microsoft.com/en-us/library/dd921056%28v=office.12%29.aspx] + [2.4.6.4|http://msdn.microsoft.com/en-us/library/dd945275%28v=office.12%29.aspx]. Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour. The current testcase ({{WordParserTest.java}}) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well. Make sure to apply [this patch|https://issues.apache.org/bugzilla/show_bug.cgi?id=56998] to POI before using this. was (Author: morido): Hmm, apparently files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to ListManager.tar.bz2 and ListNumbering.patch which I propose as substitutes for Filip's work. The original patch proposed by Filip is quite good but it lacks true support for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting feature of Word), it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists and there is no support for either legal formatting or levels which restart at arbitrary more-significant levels. Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods (intToRoman() + intToLetter()) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika. The code is an attempt to fully implement the algorithm outlined in [MS-DOC], v20140721, 2.4.6.3 + 2.4.6.4. Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour. The current testcase (WordParserTest.java) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well. Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to POI before using this. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.7 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here:
[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moritz Dorka updated TIKA-1315: --- Attachment: ListNumbering.patch ListManager.tar.bz2 File paths are relative to the tika-parsers subproject Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.7 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142495#comment-14142495 ] Moritz Dorka commented on TIKA-1315: Hmm, apparently files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to ListManager.tar.bz2 and ListNumbering.patch which I propose as substitutes for Filip's work. The original patch proposed by Filip is quite good but it lacks true support for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting feature of Word), it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists and there is no support for either legal formatting or levels which restart at arbitrary more-significant levels. Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods (intToRoman() + intToLetter()) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika. The code is an attempt to fully implement the algorithm outlined in [MS-DOC], v20140721, 2.4.6.3 + 2.4.6.4. Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour. The current testcase (WordParserTest.java) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well. Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to POI before using this. Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.7 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)