[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572018#comment-14572018 ] Hudson commented on TIKA-1315: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #727 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/727/]) TIKA-1315 cleanup after run against govdocs1 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1683450) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.10 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571972#comment-14571972 ] Tim Allison commented on TIKA-1315: --- Fixed issues identified during pre-release run against govdocs1 mentioned [here| https://mail-archives.apache.org/mod_mbox/tika-dev/201506.mbox/ajax/%3CDM2PR09MB07138B8183F73F3F800779D5C7B50%40DM2PR09MB0713.namprd09.prod.outlook.com%3E]. Confirmed fixes on govdocs1. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.10 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563501#comment-14563501 ] Tim Allison commented on TIKA-1315: --- I added the relevant part of [~morido]'s test document to POI via POI-57889 so that we'll actually be able to handle overrides in docx when the next version of POI is released. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563419#comment-14563419 ] Hudson commented on TIKA-1315: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #715 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/715/]) TIKA-1315 -- basic list support for WordExtractor; still need to add in override behavior once we add a class to ooxml via POI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1682287) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.doc * /tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.docx * /tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.doc * /tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.docx > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563302#comment-14563302 ] Tim Allison commented on TIKA-1315: --- With many thanks to [~drndos] and [~morido], I've added _basic_ list support for doc and docx files in r1682287. [~morido], your patch and test doc were crucial. We have to add an ooxml class via POI before we can make overrides work for docx. I've coded+commented that chunk out of the code and there's a test case that is commented out (again, thanks to [~morido]'s test doc). I won't close this issue until that is updated I have no doubt that further work remains, but this should be a good start. [~gullbyrd], many apologies for the amount of time this took. Please build from trunk and run against your docs to see if this meets your needs. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562697#comment-14562697 ] Moritz Dorka commented on TIKA-1315: The case of "none" for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562699#comment-14562699 ] Moritz Dorka commented on TIKA-1315: The case of "none" for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562698#comment-14562698 ] Moritz Dorka commented on TIKA-1315: The case of "none" for the numberText of the current level (i.e. its nfc equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc specification. Hence, this 5th test. Don't know if that has changed with the new Ecma/ISO XML specs. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561619#comment-14561619 ] Tim Allison commented on TIKA-1315: --- Alright, good to go on both doc and docx for tests 1-4. Do you happen to remember how you built the list in test 5? > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561436#comment-14561436 ] Tim Allison commented on TIKA-1315: --- Great. Thank you. Between your description and [this|http://officeopenxml.com/WPnumbering-restart.php], we're good to go on .docx for your tests 1-4. I'm holding off on your fifth test for now. Turning now to .doc. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560750#comment-14560750 ] Moritz Dorka commented on TIKA-1315: bq. For test 2, how did you get 1.b.III? It's been quite a while since I authored that file. But from first glance I suppose this happens because the restartLim is not set to the first (ordinary case) but the second most-significant ilvl. This means the third level will only see a reset each time an item belonging to the first level occurs. Since "1.b", which precedes the element in question, belongs to the second level no such reset happens and the "II" from "1.a.II" gets incremented, instead. See the definition of [ilvlRestartLim|https://msdn.microsoft.com/en-us/library/dd923594%28v=office.12%29.aspx] for a more complicated explanation. I do not know Word's XML-stuff, but given the logic hasn't changed the above would mean you should somewhere encounter a ilvlRestartLim of 1 (not 3!) associated with the currently applicable lvl (which may come from an override). > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560311#comment-14560311 ] Tim Allison commented on TIKA-1315: --- Breaks current code quite nicely. Thank you. :) For test 2, how did you get 1.b.III? I can't find any trace of restart=3 in the xml when I save the file as .docx. Or, is that a continue count from previous? > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540582#comment-14540582 ] Tim Allison commented on TIKA-1315: --- Thank you! Will try your steps to get override. When I tried something like that before, Word created a new list. This test doc is quite helpful. I'll see how well it breaks my doc/docx code. :) > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528359#comment-14528359 ] Moritz Dorka commented on TIKA-1315: [~talli...@mitre.org], I've attached [^complex_list_test.doc] to this bug which exhibits _some_, but unfortunately not all, possible list numbering traps. A correct algorithm should compute the same value for both the numberText and the paragraph content. However, I wasn't able to squeeze all possible circumstances under which Word creates those nasty {{ListFormatOverrideLevels}} into this test file. The general idea (and this might help you for .docx, which I do not have access to) is to create a list in Word, select a few entries, open the list's formatting properties, make some changes, go into the advanced formatting features (in Word 2003 that is a button in the lower right part of the dialogue which causes the window to expand) and let those changes only apply to the current selection. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, > complex_list_test.doc > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527764#comment-14527764 ] Tim Allison commented on TIKA-1315: --- [~morido], any chance you could attach a file (or one for .doc and one for .docx) that we could use to test {{ListFormatOverrideLevels}}? I've figured out how to manually reset the numbers and my current code handles that; but I can't figure out how to get Word to have numbering override abstract numbering info (in docx terms, that is). > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527751#comment-14527751 ] Tim Allison commented on TIKA-1315: --- I've been trying to create a ListManager class that can be used for both doc and docx. I found that we need to add a few classes at the POI level to get the number format string for docx (e.g. "%1.") into the ooxml-lite jar. In [POI-57889 | https://bz.apache.org/bugzilla/show_bug.cgi?id=57889], I added code to XWPFParagraph to handle that and the override starts. I initially thought that the number format string isn't that important; but it really is, especially if the numbering is along the lines: {noformat} 1 1.1 1.1.1 {noformat} So, we'll have to wait for the release of the next version of POI before we can close out this ticket. That said, I can and will continue to prep the code at the Tika level so that we're ready to go. The next version of POI is due out in the next week or so. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505042#comment-14505042 ] Moritz Dorka commented on TIKA-1315: I believe I could speed up the process by ultimately writing a unit test for the POI-part... I'm just having a hard time motivating myself to write unit tests for a few stupid getters. What you could also do is to hardcode {code}getLevelNumberingPlaceholderOffsets(){code} to always return {code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all (trivial) cases (however, I have not tested the reaction of my code to such cheating). There is also a very subtle bug left in my code which only triggers in ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find the time I will update my patch. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505004#comment-14505004 ] Moritz Dorka commented on TIKA-1315: Well, the original patch by Filip is essentially an 80% solution. Everything that I added is rather obscure functionality... > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505008#comment-14505008 ] Tim Allison commented on TIKA-1315: --- Ha. Ok, but your patch is really well done. Let me take a look at Filip's. I'll see if we can find someone on POI to add that call soon. Thank you! > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498489#comment-14498489 ] Tim Allison commented on TIKA-1315: --- [~morido], thank you for this patch. Is there any way to "cheat" until the patch can be made to POI so that we can get an 80% solution now? {{getLevelNumberingPlaceholderOffsets()}} looks important to me on first glance... > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142495#comment-14142495 ] Moritz Dorka commented on TIKA-1315: Hmm, apparently files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to ListManager.tar.bz2 and ListNumbering.patch which I propose as substitutes for Filip's work. The original patch proposed by Filip is quite good but it lacks true support for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting feature of Word), it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists and there is no support for either legal formatting or levels which restart at arbitrary more-significant levels. Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods (intToRoman() + intToLetter()) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika. The code is an attempt to fully implement the algorithm outlined in [MS-DOC], v20140721, 2.4.6.3 + 2.4.6.4. Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour. The current testcase (WordParserTest.java) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well. Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to POI before using this. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.7 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013974#comment-14013974 ] Nick Burch commented on TIKA-1315: -- In ListUtils, might be safest to make UNORDERED_LIST_CHAR use the unicode escape sequence for that character, and maybe initialise the map up front ListUtils seems to largely be based on someone else's work - do you have their OK to contribute it to Tika? (We can only accept code that is either already under an appropriate license, or willingly contributed by the author) The unit test looks a little slim - any chance of something that checks in a bit more detail? eg explicit entry checks, html level checks etc? Can we do the same for XWPF / .docx files using the same / similar logic? > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.6 > > Attachments: ListUtils.java, WordExtractor.java.patch, > WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013919#comment-14013919 ] Filip Bednárik commented on TIKA-1315: -- Sure, I updated the issue. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.6 > > Attachments: ListUtils.java, WordExtractor.java, > WordExtractor.java.patch, WordParserTest.java, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013873#comment-14013873 ] Nick Burch commented on TIKA-1315: -- Any chance you could post a patch of your changes, rather than complete changed files? > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.6 > > Attachments: ListUtils.java, WordExtractor.java, WordParserTest.java > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.2#6252)