[jira] [Commented] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X

2015-03-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375439#comment-14375439
 ] 

Shai Erera commented on TIKA-1131:
--

Hi [~tpalsulich] thanks for getting back to me, but I've since then replaced my 
laptop and I don't have that sample file anymore. I can close the issue for now 
and if I'll run into it again I'll report back. OK?

> Output sentence-break "hints" for files such as PPT/X
> -
>
> Key: TIKA-1131
> URL: https://issues.apache.org/jira/browse/TIKA-1131
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Shai Erera
>Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I 
> believe that usually these files contain text that does not end with the 
> usual sentence breaks. As I've shown in the email, the parser seems to detect 
> e.g. different bullets by inserting manual '\n' characters, but that's not 
> enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could 
> then replace with a true sentence break (e.g. \u2029), rather than 
> arbitrarily replacing every '\n', which I think is not a good general 
> solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I 
> think  tags).
> I'll upload an isolated test which generates the output as I put in the email.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1580:

Fix Version/s: 1.8

> ISA-Tab parsers
> ---
>
> Key: TIKA-1580
> URL: https://issues.apache.org/jira/browse/TIKA-1580
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.8
>
> Attachments: TIKA-1580.patch
>
>
> We are going to add parsers for ISA-Tab data formats.
> ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
> to manage an increasingly diverse set of life science, environmental and 
> biomedical experiments that employing one or a combination of technologies.
> The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
> format. Therefore, ISA-Tab data format includes three types of file: 
> Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
> ({{a_.txt}}). These files are organized as [top-down 
> hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
> file includes one or more Study files: each Study files includes one or more 
> Assay files.
> Essentially, the Investigation files contains high-level information about 
> the related study, so it provides only metadata about ISA-Tab files.
> More details on file format specification are [available 
> online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
> The patch in attachment provides a preliminary version of ISA-Tab parsers 
> (there are three parsers; one parser for each ISA-Tab filetype):
> * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
> only metadata.
> * {{ISATabStudyParser.java}}: parses Study files.
> * {{ISATabAssayParser.java}}: parses Assay files.
> The most important improvements are:
> * Combine these three parsers in order to parse an ISArchive
> * Provide a better mapping of both study and assay data on XHML. Currently, 
> {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
> function relying on [Apache Commons 
> CSV|https://commons.apache.org/proper/commons-csv/].
> Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375385#comment-14375385
 ] 

Chris A. Mattmann commented on TIKA-1580:
-

Great work [~gostep] I think the main question is whether or not to include 
those example files which seem to come from: 
http://isatab.sourceforge.net/examples.html or from 
https://github.com/bobular/Bio-Parser-ISATab. Either one looks like GPL or 
http://isatab.sourceforge.net/licenses/ISAcreator-license.html. 

I recommend simply creating your own sample files and testing based on that.

Also I think you forgot to include the parsers in 
https://reviews.apache.org/r/32291/ Thanks!

> ISA-Tab parsers
> ---
>
> Key: TIKA-1580
> URL: https://issues.apache.org/jira/browse/TIKA-1580
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Giuseppe Totaro
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.8
>
> Attachments: TIKA-1580.patch
>
>
> We are going to add parsers for ISA-Tab data formats.
> ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
> to manage an increasingly diverse set of life science, environmental and 
> biomedical experiments that employing one or a combination of technologies.
> The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
> format. Therefore, ISA-Tab data format includes three types of file: 
> Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
> ({{a_.txt}}). These files are organized as [top-down 
> hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
> file includes one or more Study files: each Study files includes one or more 
> Assay files.
> Essentially, the Investigation files contains high-level information about 
> the related study, so it provides only metadata about ISA-Tab files.
> More details on file format specification are [available 
> online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
> The patch in attachment provides a preliminary version of ISA-Tab parsers 
> (there are three parsers; one parser for each ISA-Tab filetype):
> * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
> only metadata.
> * {{ISATabStudyParser.java}}: parses Study files.
> * {{ISATabAssayParser.java}}: parses Assay files.
> The most important improvements are:
> * Combine these three parsers in order to parse an ISArchive
> * Provide a better mapping of both study and assay data on XHML. Currently, 
> {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
> function relying on [Apache Commons 
> CSV|https://commons.apache.org/proper/commons-csv/].
> Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1580) ISA-Tab parsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1580:
---

Assignee: Chris A. Mattmann

> ISA-Tab parsers
> ---
>
> Key: TIKA-1580
> URL: https://issues.apache.org/jira/browse/TIKA-1580
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.8
>
> Attachments: TIKA-1580.patch
>
>
> We are going to add parsers for ISA-Tab data formats.
> ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
> to manage an increasingly diverse set of life science, environmental and 
> biomedical experiments that employing one or a combination of technologies.
> The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
> format. Therefore, ISA-Tab data format includes three types of file: 
> Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
> ({{a_.txt}}). These files are organized as [top-down 
> hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
> file includes one or more Study files: each Study files includes one or more 
> Assay files.
> Essentially, the Investigation files contains high-level information about 
> the related study, so it provides only metadata about ISA-Tab files.
> More details on file format specification are [available 
> online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
> The patch in attachment provides a preliminary version of ISA-Tab parsers 
> (there are three parsers; one parser for each ISA-Tab filetype):
> * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
> only metadata.
> * {{ISATabStudyParser.java}}: parses Study files.
> * {{ISATabAssayParser.java}}: parses Assay files.
> The most important improvements are:
> * Combine these three parsers in order to parse an ISArchive
> * Provide a better mapping of both study and assay data on XHML. Currently, 
> {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
> function relying on [Apache Commons 
> CSV|https://commons.apache.org/proper/commons-csv/].
> Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-22 Thread Giuseppe Totaro

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/
---

(Updated March 23, 2015, 4:39 a.m.)


Review request for tika and Chris Mattmann.


Bugs: TIKA-1580
https://issues.apache.org/jira/browse/TIKA-1580


Repository: tika


Description
---

ISATab parsers. This preliminary solution provides three parsers, one for each 
ISA-Tab filetype (Investigation, Study, Assay).


Diffs
-

  trunk/tika-bundle/pom.xml 1667912 
  trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
1667912 
  trunk/tika-parsers/pom.xml 1667912 
  
trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
 1667912 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
 PRE-CREATION 

Diff: https://reviews.apache.org/r/32291/diff/


Testing
---

Tested on sample ISA-Tab files downloaded from 
http://www.isa-tools.org/format/examples/.


Thanks,

Giuseppe Totaro



Re: Review Request 32255: File type description to HDFParser

2015-03-22 Thread Ann Burgess

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32255/
---

(Updated March 23, 2015, 4:33 a.m.)


Review request for tika.


Bugs: TIKA-1578
https://issues.apache.org/jira/browse/TIKA-1578


Repository: tika


Description
---

Added a file type descritpion to the HDFParser as NetCDF4 files have the .nc 
extension, but use the HDFParser.


Diffs
-

  trunk/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java 
1667844 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java 
1667844 

Diff: https://reviews.apache.org/r/32255/diff/


Testing
---

Unit testing using HDFParserTest


Thanks,

Ann Burgess



Re: Review Request 32260: Add file type description to NetCDF parser

2015-03-22 Thread Ann Burgess

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32260/
---

(Updated March 23, 2015, 4:33 a.m.)


Review request for tika.


Bugs: TIKA-1579
https://issues.apache.org/jira/browse/TIKA-1579


Repository: tika


Description
---

Outputs filetype with metadata


Diffs
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java
 1667874 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/netcdf/NetCDFParserTest.java
 1667874 

Diff: https://reviews.apache.org/r/32260/diff/


Testing
---

Unit testing with NetCDFParserTest


Thanks,

Ann Burgess



Re: Review Request 32260: Add file type description to NetCDF parser

2015-03-22 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32260/#review77371
---

Ship it!


Ship It!

- Chris Mattmann


On March 19, 2015, 9:22 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32260/
> ---
> 
> (Updated March 19, 2015, 9:22 p.m.)
> 
> 
> Review request for tika.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> Outputs filetype with metadata
> 
> 
> Diffs
> -
> 
>   
> trunk/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java
>  1667874 
>   
> trunk/tika-parsers/src/test/java/org/apache/tika/parser/netcdf/NetCDFParserTest.java
>  1667874 
> 
> Diff: https://reviews.apache.org/r/32260/diff/
> 
> 
> Testing
> ---
> 
> Unit testing with NetCDFParserTest
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



Re: Review Request 32255: File type description to HDFParser

2015-03-22 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32255/#review77370
---

Ship it!


Ship It!

- Chris Mattmann


On March 19, 2015, 7:45 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32255/
> ---
> 
> (Updated March 19, 2015, 7:45 p.m.)
> 
> 
> Review request for tika.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> Added a file type descritpion to the HDFParser as NetCDF4 files have the .nc 
> extension, but use the HDFParser.
> 
> 
> Diffs
> -
> 
>   trunk/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java 
> 1667844 
>   
> trunk/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java
>  1667844 
> 
> Diff: https://reviews.apache.org/r/32255/diff/
> 
> 
> Testing
> ---
> 
> Unit testing using HDFParserTest
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



[jira] [Updated] (TIKA-1578) Add file type description to HDFParsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1578:

Fix Version/s: 1.8

> Add file type description to HDFParsers
> ---
>
> Key: TIKA-1578
> URL: https://issues.apache.org/jira/browse/TIKA-1578
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: parser
> Fix For: 1.8
>
> Attachments: TIKA-1578.abburgess.150319.patch.txt
>
>
> [~gostep] explains that, there are three versions of NetCDF (classic format, 
> 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
> file, the netCDF library will transparently detect its format so we do not 
> need to adjust according to the detected format. 
> That said, it would be good to know the file type as each can have the .nc 
> extension.  This will add patch with add file type to the metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1577:

Fix Version/s: 1.8

> NetCDF Data Extraction
> --
>
> Key: TIKA-1577
> URL: https://issues.apache.org/jira/browse/TIKA-1577
> Project: Tika
>  Issue Type: Improvement
>  Components: handler, parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: features, handler
> Fix For: 1.8
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1578) Add file type description to HDFParsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375367#comment-14375367
 ] 

Chris A. Mattmann commented on TIKA-1578:
-

Ship it!

> Add file type description to HDFParsers
> ---
>
> Key: TIKA-1578
> URL: https://issues.apache.org/jira/browse/TIKA-1578
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: parser
> Fix For: 1.8
>
> Attachments: TIKA-1578.abburgess.150319.patch.txt
>
>
> [~gostep] explains that, there are three versions of NetCDF (classic format, 
> 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
> file, the netCDF library will transparently detect its format so we do not 
> need to adjust according to the detected format. 
> That said, it would be good to know the file type as each can have the .nc 
> extension.  This will add patch with add file type to the metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1578) Add file type description to HDFParsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1578:

Labels: parser  (was: )

> Add file type description to HDFParsers
> ---
>
> Key: TIKA-1578
> URL: https://issues.apache.org/jira/browse/TIKA-1578
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: parser
> Fix For: 1.8
>
> Attachments: TIKA-1578.abburgess.150319.patch.txt
>
>
> [~gostep] explains that, there are three versions of NetCDF (classic format, 
> 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
> file, the netCDF library will transparently detect its format so we do not 
> need to adjust according to the detected format. 
> That said, it would be good to know the file type as each can have the .nc 
> extension.  This will add patch with add file type to the metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1578) Add file type description to HDFParsers

2015-03-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1578:

Affects Version/s: 1.7

> Add file type description to HDFParsers
> ---
>
> Key: TIKA-1578
> URL: https://issues.apache.org/jira/browse/TIKA-1578
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: parser
> Fix For: 1.8
>
> Attachments: TIKA-1578.abburgess.150319.patch.txt
>
>
> [~gostep] explains that, there are three versions of NetCDF (classic format, 
> 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
> file, the netCDF library will transparently detect its format so we do not 
> need to adjust according to the detected format. 
> That said, it would be good to know the file type as each can have the .nc 
> extension.  This will add patch with add file type to the metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Hello!

2015-03-22 Thread Tyler Palsulich
Hi Ji-Hyun,

Great! Please let us know if you have any questions or advice for future
newcomers. :)

You might want to check out the contributors
 page for an intro to Tika's
development.

Have a good night,
Tyler

On Sun, Mar 22, 2015 at 11:12 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Thanks Ji-Hyun welcome and please feel free to ask any
> questions you may have! :)
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , "Ji-Hyun   (329F-Affiliate)" 
> Reply-To: "dev@tika.apache.org" 
> Date: Sunday, March 22, 2015 at 8:03 PM
> To: "dev@tika.apache.org" 
> Subject: Hello!
>
> >Dear all,
> >
> >My name is Ji-Hyun Oh. I am a Post Doc working with Dr. Chris Mattmann.
> >
> >To capture geoscience information, I am trying to get familiar with Tika.
> >
> >Although I am currently taking very baby steps with Tika, I hope I will
> >be able to contribute to Tika in near future.
> >
> >
> >Thanks!
> >
> >Ji-Hyun
> >
>
>


Re: Hello!

2015-03-22 Thread Mattmann, Chris A (3980)
Thanks Ji-Hyun welcome and please feel free to ask any
questions you may have! :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , "Ji-Hyun   (329F-Affiliate)" 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, March 22, 2015 at 8:03 PM
To: "dev@tika.apache.org" 
Subject: Hello! 

>Dear all,
>
>My name is Ji-Hyun Oh. I am a Post Doc working with Dr. Chris Mattmann.
>
>To capture geoscience information, I am trying to get familiar with Tika.
>
>Although I am currently taking very baby steps with Tika, I hope I will
>be able to contribute to Tika in near future.
>
>
>Thanks!
>
>Ji-Hyun
>



Hello!

2015-03-22 Thread Oh, Ji-Hyun (329F-Affiliate)
Dear all,

My name is Ji-Hyun Oh. I am a Post Doc working with Dr. Chris Mattmann.

To capture geoscience information, I am trying to get familiar with Tika.

Although I am currently taking very baby steps with Tika, I hope I will be able 
to contribute to Tika in near future.


Thanks!

Ji-Hyun



[jira] [Closed] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1460.
-
Resolution: Cannot Reproduce

Closing as Cannot Reproduce, since it's been a month since my last comment and 
we don't have the file which reproduces the issue. Please reopen if you're 
still running into this!

> Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
> --
>
> Key: TIKA-1460
> URL: https://issues.apache.org/jira/browse/TIKA-1460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: win7,myeclipse8.5
>Reporter: onyas
>Priority: Critical
>
> for some reason,I could not upload the file,Here is the info..
> and i checked all the version in the directory of 
> \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@d640af
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of 
> the file
>   at 
> org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:202)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:184)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 21 more
> the major code is :
> Parser parser = new AutoDetectParser();
>   ContentHandler handler = new BodyContentHandler(getNum());
>   Metadata metadata = new Metadata();
>   ParseContext context = new ParseContext();
>   InputStream stream = null;
>   StringBuffer content = new StringBuffer();
>   try {
>   stream = new FileInputStream(file);
>   if (stream != null) {
>   parser.parse(stream, handler, metadata, 
> context);
>   content = content.append(handler);
>   
>   if(StringUtils.isNotBlank(content.toString())){
>   hasContent = true;
>   handler = null;
>   metadata = null;
>   context = null;
>   }
>   }
> And the exception is throwed at this line== parser.parse(stream, handler, 
> metadata, context);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-22 Thread Tyler Palsulich
+1. Tika uses the external `gdalinfo` command to extract info from the
file. The unit tests first check `gdalinfo` is available -- if so, the
tests are run. But, it seems like in your case(s), your installation of
`gdalinfo` doesn't support the filetypes we test.

I don't think it would be correct for Tika to always ignore all results of
the GDAL tests. So, you can either not test (not a good idea), disable
those tests locally (and be careful not to check those changes into any
patches... annoying to do), move `gdalinfo` off of your PATH
(annoyingness-level depends on how you installed it and if you want to
actually get GDAL parses), uninstall `gdalinfo`, or reinstall gdal with the
complete flag (see the wiki ).

None of these options are really ideal. But, I think reinstalling with the
complete flag is the best one.

I hope that helps.

Tyler

On Sun, Mar 22, 2015 at 12:35 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Agreed Seb, moving dev@nutch.a.o into BCC and moving this to
> the Tika list.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Sebastian Nagel 
> Reply-To: "d...@nutch.apache.org" 
> Date: Sunday, March 22, 2015 at 4:32 AM
> To: "d...@nutch.apache.org" 
> Subject: Re: TestGDALParser.testParseBasicInfo and
> TestGDALParser.testParseMetadata errors
>
> >Hi,
> >
> >maybe this thread is better at dev@tika
> >since it's about building Tika.
> >
> >Btw., I can successfully build Tika trunk/1.8.
> >Looks like something system-specific, similar to TIKA-1503:
> >gdalinfo is installed, but fails to parse a certain file format.
> >
> >Thanks,
> >Sebastian
> >
> >On 03/22/2015 08:26 AM, Mohit Bagde wrote:
> >> Hi,
> >>
> >> I am also getting a similar error. Is this issue because of prior
> >>installation of gdal, tesseract? I
> >> was using Tika 1.7 but it didn't work with 1.8 when I tried to build
> >>it. I did a clean svn checkout
> >> pull and then built it buy encountered similar errors as above.
> >>
> >> Is there a patch for this? Or has anyone found a fix for this?
> >>
> >> On Mar 21, 2015 10:24 PM, "Anvesha Sinha"  >>> wrote:
> >>
> >> Hi everyone,
> >>
> >> While installing TIKA, I am getting the following error:
> >>
> >> Tests run: 3, Failures: 2, Errors: 0, Skipped: 1, Time elapsed:
> >>0.209 sec <<< FAILURE! - in
> >> org.apache.tika.parser.gdal.TestGDALParser
> >> testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)
> >>Time elapsed: 0.118 sec  <<<
> >> FAILURE!
> >> java.lang.AssertionError: null
> >> at org.junit.Assert.fail(Assert.java:86)
> >> at org.junit.Assert.assertTrue(Assert.java:41)
> >> at org.junit.Assert.assertNotNull(Assert.java:621)
> >> at org.junit.Assert.assertNotNull(Assert.java:631)
> >>  *   at
> >>org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALPar
> >>ser.java:70)
> >> *
> >> testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
> >>elapsed: 0.062 sec  <<< FAILURE!
> >> java.lang.AssertionError: null
> >> at org.junit.Assert.fail(Assert.java:86)
> >> at org.junit.Assert.assertTrue(Assert.java:41)
> >> at org.junit.Assert.assertNotNull(Assert.java:621)
> >> at org.junit.Assert.assertNotNull(Assert.java:631)
> >> *at
> >>org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(TestGDALPars
> >>er.java:111)
> >> *
> >>
> >> Just to clarify, this error is not the same as
> >>
> >> testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time
> >>elapsed: 0.206 sec  <<< FAILURE!
> >> java.lang.AssertionError
> >> at org.junit.Assert.fail(Assert.java:86)
> >> at org.junit.Assert.assertTrue(Assert.java:41)
> >> at org.junit.Assert.assertNotNull(Assert.java:621)
> >> at org.junit.Assert.assertNotNull(Assert.java:631)
> >>   *  at
> >>org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(TestGDALParser.j
> >>ava:153)
> >> *
> >> which was rectified by tpalsulich in Revision 1647742. Any
> >>guidance/help would be appreciated.
> >>
> >> Thanks,
> >> Anvesha
> >> --
> >> Graduate Student (MS in Computer Science)
> >> University of Southern California
> >> /Phone: (+1) 213-308-9002 /
> >>
> >
>
>


[jira] [Commented] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375173#comment-14375173
 ] 

Tyler Palsulich commented on TIKA-1543:
---

(I just added the logging in r1668477 a few minutes ago. See [this 
commit|https://github.com/apache/tika/commit/84825f035069d572f155f86fa4c18d5a79b48028]
 on GitHub.)

> TesseractOCRParser.setTesseractPath() doesn't work on Linux
> ---
>
> Key: TIKA-1543
> URL: https://issues.apache.org/jira/browse/TIKA-1543
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Sean Zhao
> Fix For: 1.8
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> After call setTesseractPath() to set the Tesseract path to a not-default 
> path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
> will return.
> Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1543.
-
Resolution: Fixed

This isn't actually a problem. I just tested locally -- it works.

We have unit tests for the path, but it's difficult to test that extraction 
works with a non-standard path, since we don't know what the path is...

I think the problem is either:
The path you set is not to the directory that contains the executable or 
The path doesn't have a tessdata directory inside it.

You can see all of the Tesseract debugging messages by enabling {{debug}} level 
logging (put a 
[log4j.properties|https://github.com/apache/tika/blob/10298692cb27d1ad3732589930987e2fe2681ee8/tika-parsers/src/test/resources/log4j.properties]
 file on your classpath and set the output level to {{debug}}).

I'd be happy to help you debug further.

> TesseractOCRParser.setTesseractPath() doesn't work on Linux
> ---
>
> Key: TIKA-1543
> URL: https://issues.apache.org/jira/browse/TIKA-1543
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Sean Zhao
> Fix For: 1.8
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> After call setTesseractPath() to set the Tesseract path to a not-default 
> path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
> will return.
> Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1543:
--
Fix Version/s: (was: 1.7)
   1.8

> TesseractOCRParser.setTesseractPath() doesn't work on Linux
> ---
>
> Key: TIKA-1543
> URL: https://issues.apache.org/jira/browse/TIKA-1543
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Sean Zhao
> Fix For: 1.8
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> After call setTesseractPath() to set the Tesseract path to a not-default 
> path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
> will return.
> Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1565) image/gif parse error

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1565.
---
   Resolution: Fixed
Fix Version/s: (was: 1.7)
   1.8
 Assignee: Tyler Palsulich

Marking as Fixed for 1.8. The file is now parsed without an Exception. Please 
reopen if you are still running into this issue with Trunk or 1.8 (when it is 
released some time in the future).

> image/gif parse error
> -
>
> Key: TIKA-1565
> URL: https://issues.apache.org/jira/browse/TIKA-1565
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
> Environment: win7 x64  jdk1.7
>Reporter: lixin
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
> Attachments: JNK16-1309-173.mht
>
>
> I am getting an exception parsing the following mht File
> {code}
> org.apache.tika.exception.TikaException: image/gif parse error
>   at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>   at org.apache.tika.example.MyTest.test1(MyTest.java:31)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at 
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
> Caused by: javax.imageio.IIOException: Unexpected block type 1!
>   at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown 
> Source)
>   at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source)
>   at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92)
>   ... 32 more
> {code}
> my test code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> parser.parse(new FileInputStream(new File(file)), handler, 
> metadata,context);
> System.out.println(handler.toString());
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1565) image/gif parse error

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1565:
--
Description: 
I am getting an exception parsing the following mht File
{code}
org.apache.tika.exception.TikaException: image/gif parse error
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at org.apache.tika.example.MyTest.test1(MyTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: javax.imageio.IIOException: Unexpected block type 1!
at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown 
Source)
at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source)
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92)
... 32 more
{code}
my test code:
{code}
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(new FileInputStream(new File(file)), handler, 
metadata,context);
System.out.println(handler.toString());
{code}

  was:
I am getting an exception parsing the following mht File

org.apache.tika.exception.TikaException: image/gif parse error
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at org.apache.tika.example.MyTest.test1(MyTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at

Re: TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-22 Thread Mattmann, Chris A (3980)
Agreed Seb, moving dev@nutch.a.o into BCC and moving this to
the Tika list.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Sebastian Nagel 
Reply-To: "d...@nutch.apache.org" 
Date: Sunday, March 22, 2015 at 4:32 AM
To: "d...@nutch.apache.org" 
Subject: Re: TestGDALParser.testParseBasicInfo and
TestGDALParser.testParseMetadata errors

>Hi,
>
>maybe this thread is better at dev@tika
>since it's about building Tika.
>
>Btw., I can successfully build Tika trunk/1.8.
>Looks like something system-specific, similar to TIKA-1503:
>gdalinfo is installed, but fails to parse a certain file format.
>
>Thanks,
>Sebastian
>
>On 03/22/2015 08:26 AM, Mohit Bagde wrote:
>> Hi,
>> 
>> I am also getting a similar error. Is this issue because of prior
>>installation of gdal, tesseract? I
>> was using Tika 1.7 but it didn't work with 1.8 when I tried to build
>>it. I did a clean svn checkout
>> pull and then built it buy encountered similar errors as above.
>> 
>> Is there a patch for this? Or has anyone found a fix for this?
>> 
>> On Mar 21, 2015 10:24 PM, "Anvesha Sinha" >> wrote:
>> 
>> Hi everyone,
>> 
>> While installing TIKA, I am getting the following error:
>> 
>> Tests run: 3, Failures: 2, Errors: 0, Skipped: 1, Time elapsed:
>>0.209 sec <<< FAILURE! - in
>> org.apache.tika.parser.gdal.TestGDALParser
>> testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)
>>Time elapsed: 0.118 sec  <<<
>> FAILURE!
>> java.lang.AssertionError: null
>> at org.junit.Assert.fail(Assert.java:86)
>> at org.junit.Assert.assertTrue(Assert.java:41)
>> at org.junit.Assert.assertNotNull(Assert.java:621)
>> at org.junit.Assert.assertNotNull(Assert.java:631)
>>  *   at 
>>org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALPar
>>ser.java:70)
>> *
>> testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
>>elapsed: 0.062 sec  <<< FAILURE!
>> java.lang.AssertionError: null
>> at org.junit.Assert.fail(Assert.java:86)
>> at org.junit.Assert.assertTrue(Assert.java:41)
>> at org.junit.Assert.assertNotNull(Assert.java:621)
>> at org.junit.Assert.assertNotNull(Assert.java:631)
>> *at 
>>org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(TestGDALPars
>>er.java:111)
>> *
>> 
>> Just to clarify, this error is not the same as
>> 
>> testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time
>>elapsed: 0.206 sec  <<< FAILURE!
>> java.lang.AssertionError
>> at org.junit.Assert.fail(Assert.java:86)
>> at org.junit.Assert.assertTrue(Assert.java:41)
>> at org.junit.Assert.assertNotNull(Assert.java:621)
>> at org.junit.Assert.assertNotNull(Assert.java:631)
>>   *  at 
>>org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(TestGDALParser.j
>>ava:153)
>> *
>> which was rectified by tpalsulich in Revision 1647742. Any
>>guidance/help would be appreciated.
>> 
>> Thanks,
>> Anvesha
>> -- 
>> Graduate Student (MS in Computer Science)
>> University of Southern California
>> /Phone: (+1) 213-308-9002 /
>> 
>



[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2015-03-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14374986#comment-14374986
 ] 

Nick Burch commented on TIKA-1344:
--

This might be a good one to add as an example, based on the recursing example 
we have but showing how to marry the content handler changes with parser 
resources

> Ability to generate self-contained HTML with images
> ---
>
> Key: TIKA-1344
> URL: https://issues.apache.org/jira/browse/TIKA-1344
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Andrew Skiba
>  Labels: easyfix, patch
> Attachments: word.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> n the current code, the images from Word documents are referenced by 
> "embedded:xxx" links in the generated HTML. This causes the browsers display 
> "x" icon instead of the image.
> The proposed patch encodes the images using Data URI, if there is 
> -Dtika.parsers.urlimages system property. 
> http://en.wikipedia.org/wiki/Data_URI_scheme
> So the default behavior is the same, but users of the library can optionally 
> generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1325) Move the font metadata definitions to properties

2015-03-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14374985#comment-14374985
 ] 

Nick Burch commented on TIKA-1325:
--

I think we either need to find an external standard, or invent our own. 
Currently I think we're still on plain strings, when we really want properties 
(external or ours)

> Move the font metadata definitions to properties
> 
>
> Key: TIKA-1325
> URL: https://issues.apache.org/jira/browse/TIKA-1325
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.5, 1.6
>Reporter: Nick Burch
> Attachments: TIKA-1325_TimeZone.patch
>
>
> As noticed while working on TIKA-1182, the AFM font parser has a bunch of 
> hard coded strings it uses as metadata keys, while the TTF font parser 
> doesn't have many
> We should switch these to being proper Properties, with definitions from a 
> well known standard (+ compatibility fallbacks), and have both use largely 
> the same set



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)