[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code

2015-05-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564115#comment-14564115
 ] 

Chris A. Mattmann commented on TIKA-1634:
-

[~ji-hyun...@jpl.nasa.gov] please don't close the issue until you've submit a 
patch that adds the magic that makes ALL tests pass locally. Can you submit a 
patch for the magics for that that we worked on?

> Detecting problem with Matlab source code
> -
>
> Key: TIKA-1634
> URL: https://issues.apache.org/jira/browse/TIKA-1634
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.8
>Reporter: Ji-Hyun Oh
>Priority: Trivial
> Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   
> <_comment>Matlab source code
> 
>   
> 
> 
> 
>   
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> 
> <_comment>Matlab source code
> 
>   
>   
> 
> 
> 
>   
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1642) Integrate cTAKES into Tika

2015-05-28 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro reassigned TIKA-1642:
-

Assignee: Giuseppe Totaro

> Integrate cTAKES into Tika
> --
>
> Key: TIKA-1642
> URL: https://issues.apache.org/jira/browse/TIKA-1642
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Selina Chu
>Assignee: Giuseppe Totaro
>
> [~gostep] has written a preliminary version of 
> [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
> to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
> The CTAKESContentHandler allows to perform the following step into Tika:
> * create an AnalysisEngine based on a given XML descriptor;
> * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
> * populate the CAS with the text extracted by using Tika;
> * perform the AnalysisEngine against the plain text added to CAS;
> * write out the results in the given format (XML, XCAS, XMI, etc.).
> It would be great improvement if we can parse the output of cTAKES and create 
> a list of metadata which describes the terms found in the annotation index 
> and their corresponding tokens. For instance, using the 
> AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
> database to obtain the annotations related to DiseaseDisorderMention, and I 
> would like to be able to produce a list of words corresponding to the input 
> text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika

2015-05-28 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563993#comment-14563993
 ] 

Giuseppe Totaro commented on TIKA-1642:
---

Hi [~selina], I believe that is a great idea. I am going right now to update my 
code on GitHub and add support for cTAKES metadata as suggested by you.
Then, I will post here a new patch for Tika.
Thanks a lot,
Giuseppe

> Integrate cTAKES into Tika
> --
>
> Key: TIKA-1642
> URL: https://issues.apache.org/jira/browse/TIKA-1642
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Selina Chu
>
> [~gostep] has written a preliminary version of 
> [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
> to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
> The CTAKESContentHandler allows to perform the following step into Tika:
> * create an AnalysisEngine based on a given XML descriptor;
> * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
> * populate the CAS with the text extracted by using Tika;
> * perform the AnalysisEngine against the plain text added to CAS;
> * write out the results in the given format (XML, XCAS, XMI, etc.).
> It would be great improvement if we can parse the output of cTAKES and create 
> a list of metadata which describes the terms found in the annotation index 
> and their corresponding tokens. For instance, using the 
> AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
> database to obtain the annotations related to DiseaseDisorderMention, and I 
> would like to be able to produce a list of words corresponding to the input 
> text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1642) Integrate cTAKES into Tika

2015-05-28 Thread Selina Chu (JIRA)
Selina Chu created TIKA-1642:


 Summary: Integrate cTAKES into Tika
 Key: TIKA-1642
 URL: https://issues.apache.org/jira/browse/TIKA-1642
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Selina Chu


[~gostep] has written a preliminary version of 
[CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to 
integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.

The CTAKESContentHandler allows to perform the following step into Tika:

* create an AnalysisEngine based on a given XML descriptor;
* create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
* populate the CAS with the text extracted by using Tika;
* perform the AnalysisEngine against the plain text added to CAS;
* write out the results in the given format (XML, XCAS, XMI, etc.).

It would be great improvement if we can parse the output of cTAKES and create a 
list of metadata which describes the terms found in the annotation index and 
their corresponding tokens. For instance, using the 
AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
database to obtain the annotations related to DiseaseDisorderMention, and I 
would like to be able to produce a list of words corresponding to the input 
text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1634) Detecting problem with Matlab source code

2015-05-28 Thread Ji-Hyun Oh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji-Hyun Oh closed TIKA-1634.

Resolution: Fixed

> Detecting problem with Matlab source code
> -
>
> Key: TIKA-1634
> URL: https://issues.apache.org/jira/browse/TIKA-1634
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.8
>Reporter: Ji-Hyun Oh
>Priority: Trivial
> Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   
> <_comment>Matlab source code
> 
>   
> 
> 
> 
>   
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> 
> <_comment>Matlab source code
> 
>   
>   
> 
> 
> 
>   
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code

2015-05-28 Thread Ji-Hyun Oh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563642#comment-14563642
 ] 

Ji-Hyun Oh commented on TIKA-1634:
--

I tested newly updated magics with my set of matlab files. With updated magics, 
only one file failed to be detected as matlab (see the updated .xls file to see 
the result). The file started with 

%% SET the initial values for the Bayesian Anova.
load Data_INPUT

So we also added one more match value as below: 


However, I am closing my issue. 




> Detecting problem with Matlab source code
> -
>
> Key: TIKA-1634
> URL: https://issues.apache.org/jira/browse/TIKA-1634
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.8
>Reporter: Ji-Hyun Oh
>Priority: Trivial
> Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   
> <_comment>Matlab source code
> 
>   
> 
> 
> 
>   
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> 
> <_comment>Matlab source code
> 
>   
>   
> 
> 
> 
>   
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1634) Detecting problem with Matlab source code

2015-05-28 Thread Ji-Hyun Oh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji-Hyun Oh updated TIKA-1634:
-
Description: 
Both Matlab source code and Objective-C source code have the same suffix, which 
is .m. Therefore, Matlab has additional match value in mime types.xml. 

In tika-mimetypes.xml Matlab is defined as:

  
<_comment>Matlab source code

  



  

However, Matlab codes does not always start with "function [“. Therefore, some 
Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
collected from NOAA Paleoclimatology Software Resources, many Matlab codes have 
match value like these (problematic files are attached as an example):


<_comment>Matlab source code

  
  



  

Conducted several detecting tests by using different Matlab packages obtained 
from NOAA Paleoclimatology Software Resources, with/without 
Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab files 
are detected correctly with custom-mimetypes.xml, while  42 Matlab files are 
detected as Matlab files without custom-mimetypes.xml (= only with current 
match value). However, this match value for Matlab source code could be only 
common in Paleoclimatology community. 



  was:
Both Matlab source code and Objective-C source code have the same suffix, which 
is .m. Therefore, Matlab has additional match value in mime types.xml. 

In tika-mimetypes.xml Matlab is defined as:

  
<_comment>Matlab source code

  



  

However, Matlab codes does not always start with "function [“. Therefore, some 
Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
collected from NOAA Paleoclimatology Software Resources, many Matlab codes have 
match value like these (problematic files are attached as an example):


<_comment>Matlab source code

  
  



  

Conducted several detecting tests by using different Matlab packages obtained 
from NOAA Paleoclimatology Software Resources, with/without 
Custom-mimtypes.xml. Results are attached. As a results, total 121 Matlab files 
are detected correctly with custom-mimetypes.xml, while  55 Matlab files are 
detected as Matlab files without custom-mimetypes.xml (= only with current 
match value). However, this match value for Matlab source code could be only 
common in Paleoclimatology community. 




> Detecting problem with Matlab source code
> -
>
> Key: TIKA-1634
> URL: https://issues.apache.org/jira/browse/TIKA-1634
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.8
>Reporter: Ji-Hyun Oh
>Priority: Trivial
> Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   
> <_comment>Matlab source code
> 
>   
> 
> 
> 
>   
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> 
> <_comment>Matlab source code
> 
>   
>   
> 
> 
> 
>   
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1634) Detecting problem with Matlab source code

2015-05-28 Thread Ji-Hyun Oh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji-Hyun Oh updated TIKA-1634:
-
Attachment: (was: Matlab_mime-type_test.xlsx)

> Detecting problem with Matlab source code
> -
>
> Key: TIKA-1634
> URL: https://issues.apache.org/jira/browse/TIKA-1634
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.8
>Reporter: Ji-Hyun Oh
>Priority: Trivial
> Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   
> <_comment>Matlab source code
> 
>   
> 
> 
> 
>   
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> 
> <_comment>Matlab source code
> 
>   
>   
> 
> 
> 
>   
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 121 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  55 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563501#comment-14563501
 ] 

Tim Allison commented on TIKA-1315:
---

I added the relevant part of [~morido]'s test document to POI via POI-57889 so 
that we'll actually be able to handle overrides in docx when the next version 
of POI is released.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563419#comment-14563419
 ] 

Hudson commented on TIKA-1315:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #715 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/715/])
TIKA-1315 -- basic list support for WordExtractor; still need to add in 
override behavior once we add a class to ooxml via POI (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1682287)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.doc
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_numbered_list.docx
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.doc
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testWORD_override_list_numbering.docx


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1641.
---
Resolution: Won't Fix

Ah, ok.  Thank you.  Correctly recombining terms across line breaks in PDFs is 
beyond the scope of Tika.  You might want to send a note to the pdfbox users 
list and ask what strategies they'd recommend.

> Extra whitespace produced while extracting bodycontent in tika gui
> --
>
> Key: TIKA-1641
> URL: https://issues.apache.org/jira/browse/TIKA-1641
> Project: Tika
>  Issue Type: Bug
>  Components: gui, handler
>Affects Versions: 1.6
>Reporter: cheehoo
> Attachments: File1.pdf, test_ws_tika_pdfbox.txt, 
> test_ws_tika_tika.txt, tika-whitespace.png
>
>
> PDF import into tika gui added extra whitespace/newline in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563302#comment-14563302
 ] 

Tim Allison commented on TIKA-1315:
---

With many thanks to [~drndos] and [~morido], I've added _basic_ list support 
for doc and docx files in r1682287.  [~morido], your patch and test doc were 
crucial. 

We have to add an ooxml class via POI before we can make overrides work for 
docx.  I've coded+commented that chunk out of the code and there's a test case 
that is commented out (again, thanks to [~morido]'s test doc).  I won't close 
this issue until that is updated

I have no doubt that further work remains, but this should be a good start.

[~gullbyrd], many apologies for the amount of time this took.  Please build 
from trunk and run against your docs to see if this meets your needs.


> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread cheehoo (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563085#comment-14563085
 ] 

cheehoo commented on TIKA-1641:
---

The newline/extra whitespace causing problem to me is the CAPITALIZED.EMA 
i...@address.com as shown in  tika-whitespace.png. The original value suppose 
to be capitalized.em...@address.com however after parsed by tika it produced 
CAPITALIZED.EMA i...@address.com in the main content. Due to i have a program 
that checking the email address using some regex therefore with the extra 
newline/whitespace in the email address will cause the matching pattern fail. 
This problem also occurred in some of the domain name (eg : 
http://WWW.DOMAIN17. COM ) in the file.

> Extra whitespace produced while extracting bodycontent in tika gui
> --
>
> Key: TIKA-1641
> URL: https://issues.apache.org/jira/browse/TIKA-1641
> Project: Tika
>  Issue Type: Bug
>  Components: gui, handler
>Affects Versions: 1.6
>Reporter: cheehoo
> Attachments: File1.pdf, test_ws_tika_pdfbox.txt, 
> test_ws_tika_tika.txt, tika-whitespace.png
>
>
> PDF import into tika gui added extra whitespace/newline in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1641:
--
Attachment: test_ws_tika_tika.txt
test_ws_tika_pdfbox.txt

I'm attaching a straight ExtractText dump from PDFBox's app as well as 
tika-app's output with -t.

Y, there are a few more new lines between
{noformat}
asd fasd   12.23.34.45 
{noformat}

and
{noformat}
magna aliqua. Ut 
{noformat}

And there are more new lines at the end of the file.

Which newlines are causing problems for you, and do they change the meaning of 
the document or is this a problem with rendering?

> Extra whitespace produced while extracting bodycontent in tika gui
> --
>
> Key: TIKA-1641
> URL: https://issues.apache.org/jira/browse/TIKA-1641
> Project: Tika
>  Issue Type: Bug
>  Components: gui, handler
>Affects Versions: 1.6
>Reporter: cheehoo
> Attachments: File1.pdf, test_ws_tika_pdfbox.txt, 
> test_ws_tika_tika.txt, tika-whitespace.png
>
>
> PDF import into tika gui added extra whitespace/newline in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Comment: was deleted

(was: The case of "none" for the numberText of the current level (i.e. its nfc 
equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary 
*.doc specification. Hence, this 5th test. Don't know if that has changed with 
the new Ecma/ISO XML specs.)

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moritz Dorka updated TIKA-1315:
---
Comment: was deleted

(was: The case of "none" for the numberText of the current level (i.e. its nfc 
equals 0xFF or 0x17) isn't really well specified by Microsoft in their binary 
*.doc specification. Hence, this 5th test. Don't know if that has changed with 
the new Ecma/ISO XML specs.)

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562697#comment-14562697
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562699#comment-14562699
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-05-28 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562698#comment-14562698
 ] 

Moritz Dorka commented on TIKA-1315:


The case of "none" for the numberText of the current level (i.e. its nfc equals 
0xFF or 0x17) isn't really well specified by Microsoft in their binary *.doc 
specification. Hence, this 5th test. Don't know if that has changed with the 
new Ecma/ISO XML specs.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch, 
> complex_list_test.doc
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread cheehoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cheehoo updated TIKA-1641:
--
Description: PDF import into tika gui added extra whitespace/newline in the 
main content.  (was: PDF import into tika gui however found that there's extra 
space being introduced in the main content.)

> Extra whitespace produced while extracting bodycontent in tika gui
> --
>
> Key: TIKA-1641
> URL: https://issues.apache.org/jira/browse/TIKA-1641
> Project: Tika
>  Issue Type: Bug
>  Components: gui, handler
>Affects Versions: 1.6
>Reporter: cheehoo
> Attachments: File1.pdf, tika-whitespace.png
>
>
> PDF import into tika gui added extra whitespace/newline in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread cheehoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cheehoo updated TIKA-1641:
--
Attachment: File1.pdf
tika-whitespace.png

> Extra whitespace produced while extracting bodycontent in tika gui
> --
>
> Key: TIKA-1641
> URL: https://issues.apache.org/jira/browse/TIKA-1641
> Project: Tika
>  Issue Type: Bug
>  Components: gui, handler
>Affects Versions: 1.6
>Reporter: cheehoo
> Attachments: File1.pdf, tika-whitespace.png
>
>
> PDF import into tika gui however found that there's extra space being 
> introduced in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1641) Extra whitespace produced while extracting bodycontent in tika gui

2015-05-28 Thread cheehoo (JIRA)
cheehoo created TIKA-1641:
-

 Summary: Extra whitespace produced while extracting bodycontent in 
tika gui
 Key: TIKA-1641
 URL: https://issues.apache.org/jira/browse/TIKA-1641
 Project: Tika
  Issue Type: Bug
  Components: gui, handler
Affects Versions: 1.6
Reporter: cheehoo


PDF import into tika gui however found that there's extra space being 
introduced in the main content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)