[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-06-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572018#comment-14572018
 ] 

Hudson commented on TIKA-1315:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #727 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/727/])
TIKA-1315 cleanup after run against govdocs1 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1683450)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.10
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-06-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571972#comment-14571972
 ] 

Tim Allison commented on TIKA-1315:
---

Fixed issues identified during pre-release run against govdocs1 mentioned 
[here| 
https://mail-archives.apache.org/mod_mbox/tika-dev/201506.mbox/ajax/%3CDM2PR09MB07138B8183F73F3F800779D5C7B50%40DM2PR09MB0713.namprd09.prod.outlook.com%3E].

Confirmed fixes on govdocs1.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.10
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563501#comment-14563501
 ] 

Tim Allison commented on TIKA-1315:
---

I added the relevant part of [~morido]'s test document to POI via POI-57889 so 
that we'll actually be able to handle overrides in docx when the next version 
of POI is released.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563419#comment-14563419
 ] 

Hudson commented on TIKA-1315:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #715 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/715/])
TIKA-1315 -- basic list support for WordExtractor; still need to add in 
override behavior once we add a class to ooxml via POI (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1682287)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.doc
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.docx
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.doc
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.docx


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563302#comment-14563302
 ] 

Tim Allison commented on TIKA-1315:
---

With many thanks to [~drndos] and [~morido], I've added _basic_ list support 
for doc and docx files in r1682287.  [~morido], your patch and test doc were 
crucial. 

We have to add an ooxml class via POI before we can make overrides work for 
docx.  I've coded+commented that chunk out of the code and there's a test case 
that is commented out (again, thanks to [~morido]'s test doc).  I won't close 
this issue until that is updated

I have no doubt that further work remains, but this should be a good start.

[~gullbyrd], many apologies for the amount of time this took.  Please build 
from trunk and run against your docs to see if this meets your needs.


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562697#comment-14562697
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562699#comment-14562699
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562698#comment-14562698
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561619#comment-14561619
 ] 

Tim Allison commented on TIKA-1315:
---

Alright, good to go on both doc and docx for tests 1-4.  Do you happen to 
remember how you built the list in test 5?

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561436#comment-14561436
 ] 

Tim Allison commented on TIKA-1315:
---

Great.  Thank you.  Between your description and 
[this|http://officeopenxml.com/WPnumbering-restart.php], we're good to go on 
.docx for your tests 1-4.  I'm holding off on your fifth test for now.  Turning 
now to .doc.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-27 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560750#comment-14560750
 ] 

Moritz Dorka commented on TIKA-1315:


bq. For test 2, how did you get 1.b.III?

It's been quite a while since I authored that file. But from first glance I 
suppose this happens because the restartLim is not set to the first (ordinary 
case) but the second most-significant ilvl. This means the third level will 
only see a reset each time an item belonging to the first level occurs. Since 
"1.b", which precedes the element in question, belongs to the second level no 
such reset happens and the "II" from "1.a.II" gets incremented, instead.
See the definition of 
[ilvlRestartLim|https://msdn.microsoft.com/en-us/library/dd923594%28v=office.12%29.aspx]
 for a more complicated explanation.

I do not know Word's XML-stuff, but given the logic hasn't changed the above 
would mean you should somewhere encounter a ilvlRestartLim of 1 (not 3!) 
associated with the currently applicable lvl (which may come from an override).

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560311#comment-14560311
 ] 

Tim Allison commented on TIKA-1315:
---

Breaks current code quite nicely.  Thank you. :)

For test 2, how did you get 1.b.III?  I can't find any trace of restart=3 in 
the xml when I save the file as .docx.  Or, is that a continue count from 
previous?

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540582#comment-14540582
 ] 

Tim Allison commented on TIKA-1315:
---

Thank you!  Will try your steps to get override.  When I tried something like 
that before, Word created a new list.  This test doc is quite helpful.  I'll 
see how well it breaks my doc/docx code.  :)

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-05 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528359#comment-14528359
 ] 

Moritz Dorka commented on TIKA-1315:


[~talli...@mitre.org], I've attached [^complex_list_test.doc] to this bug which 
exhibits _some_, but unfortunately not all, possible list numbering traps. A 
correct algorithm should compute the same value for both the numberText and the 
paragraph content.
However, I wasn't able to squeeze all possible circumstances under which Word 
creates those nasty {{ListFormatOverrideLevels}} into this test file. The 
general idea (and this might help you for .docx, which I do not have access to) 
is to create a list in Word, select a few entries, open the list's formatting 
properties, make some changes, go into the advanced formatting features (in 
Word 2003 that is a button in the lower right part of the dialogue which causes 
the window to expand) and let those changes only apply to the current selection.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527764#comment-14527764
 ] 

Tim Allison commented on TIKA-1315:
---

[~morido], any chance you could attach a file (or one for .doc and one for 
.docx) that we could use to test {{ListFormatOverrideLevels}}?  I've figured 
out how to manually reset the numbers and my current code handles that; but I 
can't figure out how to get Word to have numbering override abstract numbering 
info (in docx terms, that is).

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527751#comment-14527751
 ] 

Tim Allison commented on TIKA-1315:
---

I've been trying to create a ListManager class that can be used for both doc 
and docx.  I found that we need to add a few classes at the POI level to get 
the number format string for docx (e.g. "%1.") into the ooxml-lite jar.  In 
[POI-57889 | https://bz.apache.org/bugzilla/show_bug.cgi?id=57889], I added 
code to XWPFParagraph to handle that and the override starts.   

I initially thought that the number format string isn't that important; but it 
really is, especially if the numbering is along the lines:
{noformat}
1
1.1
1.1.1
{noformat}
So, we'll have to wait for the release of the next version of POI before we can 
close out this ticket.  That said, I can and will continue to prep the code at 
the Tika level so that we're ready to go.  The next version of POI is due out 
in the next week or so.


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505042#comment-14505042
 ] 

Moritz Dorka commented on TIKA-1315:


I believe I could speed up the process by ultimately writing a unit test for 
the POI-part... I'm just having a hard time motivating myself to write unit 
tests for a few stupid getters.

What you could also do is to hardcode 
{code}getLevelNumberingPlaceholderOffsets(){code} to always return 
{code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all 
(trivial) cases (however, I have not tested the reaction of my code to such 
cheating).

There is also a very subtle bug left in my code which only triggers in 
ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find 
the time I will update my patch.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505004#comment-14505004
 ] 

Moritz Dorka commented on TIKA-1315:


Well, the original patch by Filip is essentially an 80% solution. Everything 
that I added is rather obscure functionality...

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505008#comment-14505008
 ] 

Tim Allison commented on TIKA-1315:
---

Ha.  Ok, but your patch is really well done.  Let me take a look at Filip's.  
I'll see if we can find someone on POI to add that call soon.  Thank you!

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498489#comment-14498489
 ] 

Tim Allison commented on TIKA-1315:
---

[~morido], thank you for this patch.  Is there any way to "cheat" until the 
patch can be made to POI so that we can get an 80% solution now?  
{{getLevelNumberingPlaceholderOffsets()}} looks important to me on first 
glance...

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2014-09-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142495#comment-14142495
 ] 

Moritz Dorka commented on TIKA-1315:


Hmm, apparently files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to ListManager.tar.bz2 and 
ListNumbering.patch which I propose as substitutes for Filip's work.

The original patch proposed by Filip is quite good but it lacks true support 
for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting 
feature of Word), it does not cope correctly with bullets / unnumbered items 
(i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of 
multilevel lists and there is no support for either legal formatting or levels 
which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods (intToRoman() + 
intToLetter()) which are still based on the original blog post cited by Filip. 
I consider them rather trivial, so it is hopefully not a problem to include 
them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC], 
v20140721, 2.4.6.3 + 2.4.6.4.

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase (WordParserTest.java) has an rather bad coverage for the 
proposed new algorithm. I have a better test file here which reaches about 80% 
(the rest being mostly error handling stuff). Give me a shout if you want that 
to be included in tika as well.

Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to 
POI before using this.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.7
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2014-05-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013974#comment-14013974
 ] 

Nick Burch commented on TIKA-1315:
--

In ListUtils, might be safest to make UNORDERED_LIST_CHAR use the unicode 
escape sequence for that character, and maybe initialise the map up front

ListUtils seems to largely be based on someone else's work - do you have their 
OK to contribute it to Tika? (We can only accept code that is either already 
under an appropriate license, or willingly contributed by the author)

The unit test looks a little slim - any chance of something that checks in a 
bit more detail? eg explicit entry checks, html level checks etc?

Can we do the same for XWPF / .docx files using the same / similar logic?

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.6
>
> Attachments: ListUtils.java, WordExtractor.java.patch, 
> WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2014-05-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013919#comment-14013919
 ] 

Filip Bednárik commented on TIKA-1315:
--

Sure, I updated the issue.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.6
>
> Attachments: ListUtils.java, WordExtractor.java, 
> WordExtractor.java.patch, WordParserTest.java, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2014-05-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013873#comment-14013873
 ] 

Nick Burch commented on TIKA-1315:
--

Any chance you could post a patch of your changes, rather than complete changed 
files?

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.6
>
> Attachments: ListUtils.java, WordExtractor.java, WordParserTest.java
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.2#6252)