[jira] [Created] (TIKA-1291) Invalid JSON output on CLI
Steffen created TIKA-1291: - Summary: Invalid JSON output on CLI Key: TIKA-1291 URL: https://issues.apache.org/jira/browse/TIKA-1291 Project: Tika Issue Type: Bug Components: cli, metadata Affects Versions: 1.5, 1.4 Reporter: Steffen Getting the metadata via CLI from tika with output format set to JSON gives sometimes invalid JSON. I only found float/array errors here in jira and thus created this ticket with a new case. In my case the file that lead to invalid JSON output was a PNG file (that I unfortunately can't provide for testing): {noformat} { Application Record Version:4, Component 1:Y component: Quantization table 0, Sampling factors 2 horiz/2 vert, Component 2:Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert, Component 3:Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert, Compression Type:Baseline, Content-Length:113081, Content-Type:image/jpeg, Data Precision:8 bits, IPTC-NAA record:24 bytes binary data, Image Height:479 pixels, Image Width:671 pixels, Number of Components:3, Resolution Units:inch, Unknown tag (0x02f0):35,0,556,479, X Resolution:220 dots, Y Resolution:220 dots, resourceName:18, tiff:BitsPerSample:8, tiff:ImageLength:479, tiff:ImageWidth:671 } {noformat} The {noformat}Unknown tag (0x02f0):35,0,556,479, {noformat} is invalid JSON. It would be nice if there's always valid json output from tika. For other cases that might not be catched via fixes by this ticket it would be nice to have a CLI argument/option that disables the output of certain (unknown?) fields or allows giving a whitelist of fieldnames to output. That way users can bridge the time until new releases of tika by being more specific on the shell. If that feature already exists I apology for not having found it directly and a hint to the CLI option would be nice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1287) Update NetCDF .jar file on Maven Central
[ https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13990649#comment-13990649 ] Lewis John McGibbney commented on TIKA-1287: This is not so much a bug as an improvement via upgrade to one dependency. Can you not use the 4.2.20 dependency published in Nov 2011? Even if it is larger you can prune transitive dependencies. Update NetCDF .jar file on Maven Central Key: TIKA-1287 URL: https://issues.apache.org/jira/browse/TIKA-1287 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Ann Burgess Labels: jar, maven, netcdf, tika, unit-test, update I am working to update the NetCDFParser file. When using the most-recent .jar file available from http://www.unidata.ucar.edu/ at the command line I receive a note about a depreciated API: javac -classpath ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar org/apache/tika/parser/netcdf/NetCDFParser.java Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. After updating the NetCDFParser file with non-deprecated methods (e.x. changing dimension.getName() to dimension.getFullName()) however, I get failed unit tests in maven, which I assume is because the Maven Central Repo has the lapsed version of the .jar file needed for NetCDF files ( http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22) . Can anyone provide insight into how I get the updated .jar file into the Maven Central Repository? Is there an alternative method to update Tika so I can run my unit tests in Maven? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1287) Update NetCDF .jar file on Maven Central
[ https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting updated TIKA-1287: Issue Type: Improvement (was: Bug) Update NetCDF .jar file on Maven Central Key: TIKA-1287 URL: https://issues.apache.org/jira/browse/TIKA-1287 Project: Tika Issue Type: Improvement Affects Versions: 1.5 Reporter: Ann Burgess Labels: jar, maven, netcdf, tika, unit-test, update I am working to update the NetCDFParser file. When using the most-recent .jar file available from http://www.unidata.ucar.edu/ at the command line I receive a note about a depreciated API: javac -classpath ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar org/apache/tika/parser/netcdf/NetCDFParser.java Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. After updating the NetCDFParser file with non-deprecated methods (e.x. changing dimension.getName() to dimension.getFullName()) however, I get failed unit tests in maven, which I assume is because the Maven Central Repo has the lapsed version of the .jar file needed for NetCDF files ( http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22) . Can anyone provide insight into how I get the updated .jar file into the Maven Central Repository? Is there an alternative method to update Tika so I can run my unit tests in Maven? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1290) Upgrade to PDFBOX 1.8.5
[ https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1290: --- Labels: trivial (was: ) Upgrade to PDFBOX 1.8.5 --- Key: TIKA-1290 URL: https://issues.apache.org/jira/browse/TIKA-1290 Project: Tika Issue Type: Improvement Reporter: Hong-Thai Nguyen Labels: trivial PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent We can update to this version, and eventually test fix also TIKA-1231 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1290) Upgrade to PDFBOX 1.8.5
[ https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1290. Resolution: Fixed r1592780 Upgrade to PDFBOX 1.8.5 --- Key: TIKA-1290 URL: https://issues.apache.org/jira/browse/TIKA-1290 Project: Tika Issue Type: Improvement Reporter: Hong-Thai Nguyen Labels: trivial PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent We can update to this version, and eventually test fix also TIKA-1231 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1287) Update NetCDF .jar file on Maven Central
[ https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991117#comment-13991117 ] Ann Burgess commented on TIKA-1287: --- I have modified my updates to the NetCDFparser to work with the current version of NetCDF on Maven Central. Once the new NetCDF version is on Maven Central, I will update the code accordingly. I have created a NetCDFParserPatch.patch file for review, but will start a new JIRA issue for that. Update NetCDF .jar file on Maven Central Key: TIKA-1287 URL: https://issues.apache.org/jira/browse/TIKA-1287 Project: Tika Issue Type: Improvement Affects Versions: 1.5 Reporter: Ann Burgess Labels: jar, maven, netcdf, tika, unit-test, update I am working to update the NetCDFParser file. When using the most-recent .jar file available from http://www.unidata.ucar.edu/ at the command line I receive a note about a depreciated API: javac -classpath ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar org/apache/tika/parser/netcdf/NetCDFParser.java Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. After updating the NetCDFParser file with non-deprecated methods (e.x. changing dimension.getName() to dimension.getFullName()) however, I get failed unit tests in maven, which I assume is because the Maven Central Repo has the lapsed version of the .jar file needed for NetCDF files ( http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22) . Can anyone provide insight into how I get the updated .jar file into the Maven Central Repository? Is there an alternative method to update Tika so I can run my unit tests in Maven? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1265) Text parsing support for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1265: -- Attachment: NetCDFParserPatch.patch Text parsing support for NetCDF --- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1265) Text parsing support for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1265: -- Attachment: NetCDFParserPatch.patch Text parsing support for NetCDF --- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1265: -- Summary: [patch] Text output for NetCDF (was: Text parsing support for NetCDF) [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1265: -- Summary: [patch] Text output for NetCDF (was: Text parsing support for NetCDF) [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991162#comment-13991162 ] Ann Burgess commented on TIKA-1265: --- This patch updates the NetCDFParser to provide 'Dimension' and 'Variable' information as --text output for NetCDF files. Additionally, the patch updates NetCDFParserTest to test the new text output. To test the new parser and create the patch, I followed the steps at: https://wiki.apache.org/nutch/HowToContribute . Please let me know if I've missed any steps along the way in the process to get this committed. The .patch file is attached. [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1265: --- Assignee: Chris A. Mattmann [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991228#comment-13991228 ] Chris A. Mattmann commented on TIKA-1265: - Tested the patch out Annie, with success! {noformat} [chipotle:~/tmp/tika] mattmann% svn co https://svn.apache.org/repos/asf/tika/trunk tika [chipotle:~/tmp/tika] mattmann% curl -O https://issues.apache.org/jira/secure/attachment/12643631/NetCDFParserPatch.patch [chipotle:~/tmp/tika] mattmann% cd tika/tika-parsers [chipotle:~/tmp/tika] mattmann% patch -p0 ../NetCDFParserPatch.patch ...long process [INFO] Apache Tika parent SUCCESS [2.272s] [INFO] Apache Tika core .. SUCCESS [22.796s] [INFO] Apache Tika parsers ... SUCCESS [1:36.476s] [INFO] Apache Tika XMP ... SUCCESS [5.774s] [INFO] Apache Tika application ... SUCCESS [25.599s] [INFO] Apache Tika OSGi bundle ... SUCCESS [32.496s] [INFO] Apache Tika server SUCCESS [43.152s] [INFO] Apache Tika Java-7 Components . SUCCESS [5.569s] [INFO] Apache Tika ... SUCCESS [0.114s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 3:55.399s [INFO] Finished at: Tue May 06 16:10:14 MDT 2014 [INFO] Final Memory: 70M/176M [INFO] {noformat} I will commit this shortly. [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: patch Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991234#comment-13991234 ] Chris A. Mattmann commented on TIKA-1265: - Tested on example NetCDF, acceptance test passed: {noformat} [chipotle:tika/tika-app/target] mattmann% java -jar tika-app-1.6-SNAPSHOT.jar /Users/mattmann/tmp/tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=Conventions content=CF-1.0/ meta name=acknowledgment content= Any use of CCSM data should acknowledge the contribution#10; of the CCSM project and CCSM sponsor agencies with the #10; following citation:#10; 'This research uses data provided by the Community Climate#10; System Model project (www.ccsm.ucar.edu), supported by the#10; Directorate for Geosciences of the National Science Foundation#10; and the Office of Biological and Environmental Research of#10; the U.S. Department of Energy.'#10;In addition, the words 'Community Climate System Model' and#10; 'CCSM' should be included as metadata for webpages referencing#10; work using CCSM data or as keywords provided to journal or book#10;publishers of your manuscripts.#10;Users of CCSM data accept the responsibility of emailing#10; citations of publications of research using CCSM data to#10; c...@ucar.edu.#10;Any redistribution of CCSM data must include this data#10; acknowledgement statement./ meta name=Content-Length content=2767916/ meta name=experiment_id content=720 ppm stabilization experiment (SRESA1B)/ meta name=table_id content=Table A1/ meta name=cmd_ln content=bds -x 256 -y 128 -m 23 -o /data/zender/data/dst_T85.nc/ meta name=contact content=c...@ucar.edu/ meta name=creation_date content=/ meta name=history content=Tue Oct 25 15:08:51 2005: ncks -O -x -v va -m sresa1b_ncar_ccsm3_0_run1_21.nc sresa1b_ncar_ccsm3_0_run1_21.nc#10;Tue Oct 25 15:07:21 2005: ncks -d time,0 sresa1b_ncar_ccsm3_0_run1_21_201912.nc sresa1b_ncar_ccsm3_0_run1_21.nc#10;Tue Oct 25 13:29:43 2005: ncks -d time,0,239 sresa1b_ncar_ccsm3_0_run1_21_209912.nc /var/www/html/tmp/sresa1b_ncar_ccsm3_0_run1_21_201912.nc#10;Thu Oct 20 10:47:50 2005: ncks -A -v va /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/sresa1b_ncar_ccsm3_0_run1_va_21_209912.nc /data/brownmc/sresa1b/atm/mo/tas/ncar_ccsm3_0/run1/sresa1b_ncar_ccsm3_0_run1_21_209912.nc#10;Wed Oct 19 14:55:04 2005: ncks -F -d time,01,1200 /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/sresa1b_ncar_ccsm3_0_run1_va_21_209912.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/sresa1b_ncar_ccsm3_0_run1_va_21_209912.nc#10;Wed Oct 19 14:53:28 2005: ncrcat /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/foo_05_1200.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/foo_1192_1196.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/sresa1b_ncar_ccsm3_0_run1_va_21_209912.nc#10;Wed Oct 19 14:50:38 2005: ncks -F -d time,05,1200 /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/va_A1.SRESA1B_1.CCSM.atmm.2000-01_cat_2099-12.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/foo_05_1200.nc#10;Wed Oct 19 14:49:45 2005: ncrcat /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/va_A1.SRESA1B_1.CCSM.atmm.2000-01_cat_2079-12.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/va_A1.SRESA1B_1.CCSM.atmm.2080-01_cat_2099-12.nc /data/brownmc/sresa1b/atm/mo/va/ncar_ccsm3_0/run1/va_A1.SRESA1B_1.CCSM.atmm.2000-01_cat_2099-12.nc#10;Created from CCSM3 case b30.040a#10; by wgstr...@ucar.edu#10; on Wed Nov 17 14:12:57 EST 2004#10; #10; For all data, added IPCC requested metadata/ meta name=references content=Collins, W.D., et al., 2005:#10; The Community Climate System Model, Version 3#10; Journal of Climate#10; #10; Main website: http://www.ccsm.ucar.edu/ meta name=source content=CCSM3.0, version beta19 (2004): #10;atmosphere: CAM3.0, T85L26;#10;ocean : POP1.4.3 (modified), gx1v3#10;sea ice : CSIM5.0, T85;#10;land : CLM3.0, gx1v3/ meta name=model_name_english content=NCAR CCSM/ meta name=project_id content=IPCC Fourth Assessment/ meta name=prg_ID content=Source file unknown Version unknown Date unknown/ meta name=realization content=1/ meta name=comment content=This simulation was initiated from year 2000 of #10; CCSM3 model run b30.030a and executed on #10; hardware cheetah.ccs.ornl.gov. The input external forcings are#10;ozone forcing: A1B.ozone.128x64_L18_1991-2100_c040528.nc#10;aerosol optics : AerosolOptics_c040105.nc#10;aerosol MMR : AerosolMass_V_128x256_clim_c031022.nc#10;carbon scaling : carbonscaling_A1B_1990-2100_c040609.nc#10;solar forcing: Fixed at 1366.5 W m-2#10;GHGs : ghg_ipcc_A1B_1870-2100_c040521.nc#10;GHG loss rates : noaamisc.r8.nc#10;volcanic forcing : none#10;DMS emissions:
[jira] [Resolved] (TIKA-1265) [patch] Text output for NetCDF
[ https://issues.apache.org/jira/browse/TIKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1265. - Resolution: Fixed Fix Version/s: 1.6 - patch committed in r1592912 and in r1592913. Thank you Annie! [patch] Text output for NetCDF -- Key: TIKA-1265 URL: https://issues.apache.org/jira/browse/TIKA-1265 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: patch Fix For: 1.6 Attachments: NetCDFParserPatch.patch Original Estimate: 672h Remaining Estimate: 672h Currently Tika extracts -metadata information from NetCDF files. We are working on a patch that will enable -text extraction, thus providing the 'Dimension' and 'Variable' information. -- This message was sent by Atlassian JIRA (v6.2#6252)