[ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209812#comment-13209812 ]
Richard Yu commented on TIKA-862: --------------------------------- The one I sent earlier do not pass the h5dump test. It also do not pass the Tika test (i.e. Just showed 4 lines) I deleted the file from my test smaples and here are the rest that I keep: [ryu@localhost hdf5]$ ls IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5 RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5 SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5 [ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5 Content-Encoding: windows-1252 Content-Length: 14800864 Content-Type: text/plain resourceName: IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5 [ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5 Content-Encoding: windows-1252 Content-Length: 20888 Content-Type: text/plain resourceName: RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5 [ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 Content-Encoding: windows-1252 Content-Length: 22187952 Content-Type: text/plain resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 [ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5 Content-Encoding: windows-1252 Content-Length: 12328128 Content-Type: text/plain resourceName: VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5 All of them works with h5dump. All of them are huge file except RNSCA.... I would download more smaller file and test it aginst Tika/h5dump. Not sure this information help you? Let me know. Thanks! Richard > JPSS HDF5 files not being detected appropriately > ------------------------------------------------ > > Key: TIKA-862 > URL: https://issues.apache.org/jira/browse/TIKA-862 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.0 > Reporter: Richard Yu > Assignee: Chris A. Mattmann > Attachments: > RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, > > RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, > > RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5 > > > As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by > Tika. See this: > from [~minfing]: > {quote} > We were trying to extract metadata from our h5 file (i.e. with JPSS > extension). We ran the following command line: > {noformat} > [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \ > > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 > Content-Encoding: windows-1252 > Content-Length: 22187952 > Content-Type: text/plain > resourceName: > SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 > [ryu@localhost hdf5extractor]$ > {noformat} > We noticed that the content type in text/plain and only 4 lines of output > (i.e. we expected al lots of metadata). > Let me know if more information is needed. Thanks! > Richard > {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira