[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562697#comment-14562697
 ] 

Moritz Dorka commented on TIKA-1315:


The case of none for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562699#comment-14562699
 ] 

Moritz Dorka commented on TIKA-1315:


The case of none for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Comment: was deleted

(was: The case of none for the numberText of the current level (i.e. its nfc 
equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary 
*.doc specification. Hence, this 5th test. Don't know if that has changed with 
the new Ecma/ISO XML specs.)

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Comment: was deleted

(was: The case of none for the numberText of the current level (i.e. its nfc 
equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary 
*.doc specification. Hence, this 5th test. Don't know if that has changed with 
the new Ecma/ISO XML specs.)

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-27 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560750#comment-14560750
 ] 

Moritz Dorka commented on TIKA-1315:


bq. For test 2, how did you get 1.b.III?

It's been quite a while since I authored that file. But from first glance I 
suppose this happens because the restartLim is not set to the first (ordinary 
case) but the second most-significant ilvl. This means the third level will 
only see a reset each time an item belonging to the first level occurs. Since 
1.b, which precedes the element in question, belongs to the second level no 
such reset happens and the II from 1.a.II gets incremented, instead.
See the definition of 
[ilvlRestartLim|https://msdn.microsoft.com/en-us/library/dd923594%28v=office.12%29.aspx]
 for a more complicated explanation.

I do not know Word's XML-stuff, but given the logic hasn't changed the above 
would mean you should somewhere encounter a ilvlRestartLim of 1 (not 3!) 
associated with the currently applicable lvl (which may come from an override).

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor

2015-05-05 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Attachment: complex_list_test.doc

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-05 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528359#comment-14528359
 ] 

Moritz Dorka commented on TIKA-1315:


[~talli...@mitre.org], I've attached [^complex_list_test.doc] to this bug which 
exhibits _some_, but unfortunately not all, possible list numbering traps. A 
correct algorithm should compute the same value for both the numberText and the 
paragraph content.
However, I wasn't able to squeeze all possible circumstances under which Word 
creates those nasty {{ListFormatOverrideLevels}} into this test file. The 
general idea (and this might help you for .docx, which I do not have access to) 
is to create a list in Word, select a few entries, open the list's formatting 
properties, make some changes, go into the advanced formatting features (in 
Word 2003 that is a button in the lower right part of the dialogue which causes 
the window to expand) and let those changes only apply to the current selection.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
 complex_list_test.doc


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505004#comment-14505004
 ] 

Moritz Dorka commented on TIKA-1315:


Well, the original patch by Filip is essentially an 80% solution. Everything 
that I added is rather obscure functionality...

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505042#comment-14505042
 ] 

Moritz Dorka commented on TIKA-1315:


I believe I could speed up the process by ultimately writing a unit test for 
the POI-part... I'm just having a hard time motivating myself to write unit 
tests for a few stupid getters.

What you could also do is to hardcode 
{code}getLevelNumberingPlaceholderOffsets(){code} to always return 
{code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all 
(trivial) cases (however, I have not tested the reaction of my code to such 
cheating).

There is also a very subtle bug left in my code which only triggers in 
ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find 
the time I will update my patch.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1468) Symbol character handling in WordExtractor

2014-11-15 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1468:
---
Attachment: WordParserTest.patch
testWORD_specialcharacters.tar.bz2

Requested jUnit testcase

 Symbol character handling in WordExtractor
 --

 Key: TIKA-1468
 URL: https://issues.apache.org/jira/browse/TIKA-1468
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Moritz Dorka
Priority: Minor
 Attachments: WordExtractor.patch, WordParserTest.patch, 
 testWORD_specialcharacters.tar.bz2


 Attached is a patch to allow for proper handling of _symbol characters_ in 
 *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word).
 Side note: I am a little unsure where exactly the boundary between the scope 
 of TIKA and POI lies here. Theorectically one could add that patch to 
 {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument,
  CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1468) Symbol character handling in WordExtractor

2014-11-15 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213535#comment-14213535
 ] 

Moritz Dorka commented on TIKA-1468:


So here is a jUnit testcase which relies on the special handling of characters 
from the Symbol font. The Microsoft specs talk about a case where these 
special characters already come in their unicode representation (thus 
triggering the fallback in [^WordExtractor.patch]). However, I have no idea how 
to create a Word file that actually shows this behavior...

Regarding the location of logic: Does TIKA actually make use of POI's 
{{AbstractWordConverter}}?


 Symbol character handling in WordExtractor
 --

 Key: TIKA-1468
 URL: https://issues.apache.org/jira/browse/TIKA-1468
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Moritz Dorka
Priority: Minor
 Attachments: WordExtractor.patch, WordParserTest.patch, 
 testWORD_specialcharacters.tar.bz2


 Attached is a patch to allow for proper handling of _symbol characters_ in 
 *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word).
 Side note: I am a little unsure where exactly the boundary between the scope 
 of TIKA and POI lies here. Theorectically one could add that patch to 
 {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument,
  CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1468) Symbol character handling in WordExtractor

2014-11-09 Thread Moritz Dorka (JIRA)
Moritz Dorka created TIKA-1468:
--

 Summary: Symbol character handling in WordExtractor
 Key: TIKA-1468
 URL: https://issues.apache.org/jira/browse/TIKA-1468
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Moritz Dorka
Priority: Minor


Attached is a patch to allow for proper handling of _symbol characters_ in 
*.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word).

Side note: I am a little unsure where exactly the boundary between the scope of 
TIKA and POI lies here. Theorectically one could add that patch to 
{{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument,
 CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1468) Symbol character handling in WordExtractor

2014-11-09 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1468:
---
Attachment: WordExtractor.patch

 Symbol character handling in WordExtractor
 --

 Key: TIKA-1468
 URL: https://issues.apache.org/jira/browse/TIKA-1468
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Moritz Dorka
Priority: Minor
 Attachments: WordExtractor.patch


 Attached is a patch to allow for proper handling of _symbol characters_ in 
 *.doc files (i.e. stuff which can be inserted via Insert-Symbol in Word).
 Side note: I am a little unsure where exactly the boundary between the scope 
 of TIKA and POI lies here. Theorectically one could add that patch to 
 {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument,
  CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1315) Basic list support in WordExtractor

2014-09-22 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142495#comment-14142495
 ] 

Moritz Dorka edited comment on TIKA-1315 at 9/22/14 8:14 AM:
-

Hmm, apparently, files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to [^ListManager.tar.bz2] and 
[^ListNumbering.patch] which I propose as substitutes for Filip's work.


\\
The original patch proposed by Filip is quite good but
*  it lacks true support for ListFormatOverrideLevels (which, admittedly, is a 
really brain-twisting feature of Word)
* it does not cope correctly with bullets / unnumbered items (i.e. stuff which 
has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists
* there is no support for legal formatting and
* no support for levels which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods ({{intToRoman()}} + 
{{intToLetter()}}) which are still based on the original blog post cited by 
Filip. I consider them rather trivial, so it is hopefully not a problem to 
include them in tika.
The code is an attempt to fully implement the algorithm outlined in MS-DOC, 
v20140721, 
[2.4.6.3|http://msdn.microsoft.com/en-us/library/dd921056%28v=office.12%29.aspx]
 + 
[2.4.6.4|http://msdn.microsoft.com/en-us/library/dd945275%28v=office.12%29.aspx].

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase ({{WordParserTest.java}}) has an rather bad coverage for 
the proposed new algorithm. I have a better test file here which reaches about 
80% (the rest being mostly error handling stuff). Give me a shout if you want 
that to be included in tika as well.

Make sure to apply [this 
patch|https://issues.apache.org/bugzilla/show_bug.cgi?id=56998] to POI before 
using this.


was (Author: morido):
Hmm, apparently files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to ListManager.tar.bz2 and 
ListNumbering.patch which I propose as substitutes for Filip's work.

The original patch proposed by Filip is quite good but it lacks true support 
for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting 
feature of Word), it does not cope correctly with bullets / unnumbered items 
(i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of 
multilevel lists and there is no support for either legal formatting or levels 
which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods (intToRoman() + 
intToLetter()) which are still based on the original blog post cited by Filip. 
I consider them rather trivial, so it is hopefully not a problem to include 
them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC], 
v20140721, 2.4.6.3 + 2.4.6.4.

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase (WordParserTest.java) has an rather bad coverage for the 
proposed new algorithm. I have a better test file here which reaches about 80% 
(the rest being mostly error handling stuff). Give me a shout if you want that 
to be included in tika as well.

Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to 
POI before using this.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.7

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 

[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor

2014-09-21 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Attachment: ListNumbering.patch
ListManager.tar.bz2

File paths are relative to the tika-parsers subproject

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.7

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2014-09-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142495#comment-14142495
 ] 

Moritz Dorka commented on TIKA-1315:


Hmm, apparently files are global to a bug in Jira and are not linked to 
specific comments... Too bad. So this is related to ListManager.tar.bz2 and 
ListNumbering.patch which I propose as substitutes for Filip's work.

The original patch proposed by Filip is quite good but it lacks true support 
for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting 
feature of Word), it does not cope correctly with bullets / unnumbered items 
(i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of 
multilevel lists and there is no support for either legal formatting or levels 
which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from 
scratch, with the exception of two helper methods (intToRoman() + 
intToLetter()) which are still based on the original blog post cited by Filip. 
I consider them rather trivial, so it is hopefully not a problem to include 
them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC], 
v20140721, 2.4.6.3 + 2.4.6.4.

Downside of my approach is that it IMHO externalizes quite a bit of 
functionality which should actually be inside POI. Since those 
ListLevelOverrides can also influence the overall formatting of the paragraph 
(something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase (WordParserTest.java) has an rather bad coverage for the 
proposed new algorithm. I have a better test file here which reaches about 80% 
(the rest being mostly error handling stuff). Give me a shout if you want that 
to be included in tika as well.

Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to 
POI before using this.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.7

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)