RE: [memex-jpl] this week action from luke
Hi professor, I just tried it with minLength set to 1024, I get the following "text/plain" I am a bit surprised BTW, the 6000 min length still give "application/xhtml+xml"; with anything below 1024 min length, I am seeing "text/plain". :) BTW, the min length I am referring/altering is as follows MimeTypes.java public int getMinLength() { // This needs to be reasonably large to be able to correctly detect // things like XML root elements after initial comment and DTDs return 64 * 1024; } Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Tuesday, April 21, 2015 7:48 PM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann , Chris Mattmann , "'Totaro, Giuseppe U (3980-Affiliate)'" , Cc: "'Bryant, Ann C (398G-Affiliate)'" , "'Zimdars, Paul A (3980-Affiliate)'" , NSF Polar CyberInfrastructure DR Students , Subject: RE: [memex-jpl] this week action from luke >Hi professor, > > >I think it highly depends on the content being read by tika, e.g. if >there is a sequence of bytes in the file that is being read and is the >same as one or more of mime types being defined in our tika-mimes.xml, >I guess that tika will put those types in its estimation list, please >note there could be multiple estimated mime types by magic-byte >detection approach. Now tika also considers the decision made by >extension detection approach, if extension says the file type it >believes is the first one in the magic type estimation list, then >certainly the first one will be returned. (the same applies to metadata >hint approach); Of course, tika also prefers the type that is the most >specialized. > >let's get back to the following question, here is my guess though. >[Prof]: Also what happens if you tweak the definition of XHTML to not >scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >Let's consider an extreme case where we only scan 10 or 1 bytes, then >it seems that magic bytes will inevitable detect nothing, and I think >it will return the something like" application/oct-stream" that is the >most general type. As mentioned, tika favours the one that is the most >specialized, if extension approach returns the one that is more >specialized, in this extreme case I believe almost every type is a >subclass of this "application/oct-stream" therefore the answer in >this extreme may be yes, I think it is very possible that CBOR type >detected by the extension approach takes over in this case... > >My idea was and still is that if the cbor self-Describing tag 55799 is >present in the cbor file, then that can be used to detect the cbor type. >Again, the cbor type will probably be appended into the magic >estimation list together with another one such as application/html, I >guess the order in the list probably also matters, the first one is >preferred over the next one. Also the decision from the extension >detection approach also play the role the break the tie. >e.g. if extension detection method agrees on cbor with one of the >estimated type in the magic list, then cbor will be returned. (again, >same thing applies to metadatahint method). > >I have not taken a closer look at a cbor file that has the tag 55799, >but I expect to see its hex is something like 0xd9d9f7 or the tag >should be present in the header with a fixed sequence of >bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is >present in the file or preferable in the header (within a reasonable >range of bytes ), I believe it can probably be used as the magic >numbers for the cbor type. > > >There is another thing I have mentioned in the jira ticket I opened >yesterday against the cbor parser and detection, it is also possible >that cbor content can be imbedded inside a plain json file, the way >that a decoder can distinguish them in that file is by looking at the >tag 55799 again. This may rarely happen but a robust parser might be >able to take care of that, tika might need to consider the use of >fastXML being used by the nutch tool when developing the cbor parser... >Again let me cite the same paragraph from the rfc, > >" a decoder might be able to parse both CBOR and JSON. > Such a decoder would need to mechanically distinguish the two > formats. An easy way for an encoder to help the decoder would be to > tag the entire CBOR item with tag 55799, the serializatio
Re: [memex-jpl] this week action from luke
Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann , Chris Mattmann , "'Totaro, Giuseppe U (3980-Affiliate)'" , Cc: "'Bryant, Ann C (398G-Affiliate)'" , "'Zimdars, Paul A (3980-Affiliate)'" , NSF Polar CyberInfrastructure DR Students , Subject: RE: [memex-jpl] this week action from luke >Hi professor, > > >I think it highly depends on the content being read by tika, e.g. if >there is a sequence of bytes in the file that is being read and is the >same as one or more of mime types being defined in our tika-mimes.xml, I >guess that tika will put those types in its estimation list, please note >there could be multiple estimated mime types by magic-byte detection >approach. Now tika also considers the decision made by extension >detection approach, if extension says the file type it believes is the >first one in the magic type estimation list, then certainly the first one >will be returned. (the same applies to metadata hint approach); >Of course, tika also prefers the type that is the most specialized. > >let's get back to the following question, here is my guess though. >[Prof]: Also what happens if you tweak the definition of XHTML to not >scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >Let's consider an extreme case where we only scan 10 or 1 bytes, then it >seems that magic bytes will inevitable detect nothing, and I think it >will return the something like" application/oct-stream" that is the most >general type. As mentioned, tika favours the one that is the most >specialized, if extension approach returns the one that is more >specialized, in this extreme case I believe almost every type is a >subclass of this "application/oct-stream" therefore the answer in >this extreme may be yes, I think it is very possible that CBOR type >detected by the extension approach takes over in this case... > >My idea was and still is that if the cbor self-Describing tag 55799 is >present in the cbor file, then that can be used to detect the cbor type. >Again, the cbor type will probably be appended into the magic estimation >list together with another one such as application/html, I guess the >order in the list probably also matters, the first one is preferred over >the next one. Also the decision from the extension detection approach >also play the role the break the tie. >e.g. if extension detection method agrees on cbor with one of the >estimated type in the magic list, then cbor will be returned. (again, >same thing applies to metadatahint method). > >I have not taken a closer look at a cbor file that has the tag 55799, but >I expect to see its hex is something like 0xd9d9f7 or the tag should be >present in the header with a fixed sequence of >bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is >present in the file or preferable in the header (within a reasonable >range of bytes ), I believe it can probably be used as the magic numbers >for the cbor type. > > >There is another thing I have mentioned in the jira ticket I opened >yesterday against the cbor parser and detection, it is also possible that >cbor content can be imbedded inside a plain json file, the way that a >decoder can distinguish them in that file is by looking at the tag 55799 >again. This may rarely happen but a robust parser might be able to take >care of that, tika might need to consider the use of fastXML being used >by the nutch tool when developing the cbor parser... >Again let me cite the same paragraph from the rfc, > >" a decoder might be able to parse both CBOR and JSON. > Such a decoder would need to mechanically distinguish the two > formats. An easy way for an encoder to help the decoder would be to > tag the entire CBOR item with tag 55799, the serialization of which > will never be found at the beginning of a JSON text." > > >Thanks >Luke > > > >-Original Message- >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >Sent: Tuesday, April 21, 2015 9:49 PM >To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) >Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); >'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com >Subject: Re: [memex-jpl] this week action from luke > >Hi Luke, > >Can you post the below conversation to dev@tika and summarize it there. >Also what happens if you tweak the definition of XHTML to not scan until >8192, but say 6000 (e.g., 0:6000), does CBOR take over then? > >Cheers, >Chris > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) NASA Jet >Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mat
RE: [memex-jpl] this week action from luke
Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like" application/oct-stream" that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this "application/oct-stream" therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended into the magic estimation list together with another one such as application/html, I guess the order in the list probably also matters, the first one is preferred over the next one. Also the decision from the extension detection approach also play the role the break the tie. e.g. if extension detection method agrees on cbor with one of the estimated type in the magic list, then cbor will be returned. (again, same thing applies to metadatahint method). I have not taken a closer look at a cbor file that has the tag 55799, but I expect to see its hex is something like 0xd9d9f7 or the tag should be present in the header with a fixed sequence of bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present in the file or preferable in the header (within a reasonable range of bytes ), I believe it can probably be used as the magic numbers for the cbor type. There is another thing I have mentioned in the jira ticket I opened yesterday against the cbor parser and detection, it is also possible that cbor content can be imbedded inside a plain json file, the way that a decoder can distinguish them in that file is by looking at the tag 55799 again. This may rarely happen but a robust parser might be able to take care of that, tika might need to consider the use of fastXML being used by the nutch tool when developing the cbor parser... Again let me cite the same paragraph from the rfc, " a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text." Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, April 21, 2015 9:49 PM To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Hi Luke, Can you post the below conversation to dev@tika and summarize it there. Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Luke Date: Wednesday, April 22, 2015 at 12:19 AM To: Chris Mattmann , "Totaro, Giuseppe U (3980-Affiliate)" , Chris Mattmann Cc: "Bryant, Ann C (398G-Affiliate)" , "Zimdars, Paul A (3980-Affiliate)" , NSF Polar CyberInfrastructu
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506414#comment-14506414 ] Hudson commented on TIKA-1610: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #640 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/640/]) WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by LukeLiush this closes #42 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675250) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > CBOR Parser and detection [improvement] > --- > > Key: TIKA-1610 > URL: https://issues.apache.org/jira/browse/TIKA-1610 > Project: Tika > Issue Type: New Feature > Components: detector, mime, parser >Affects Versions: 1.7 >Reporter: Luke sh >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: memex > Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, > rfc_cbor.jpg > > > CBOR is a data format whose design goals include the possibility of extremely > small code size, fairly small message size, and extensibility without the > need for version negotiation (cited from http://cbor.io/ ). > It would be great if Tika is able to provide the support with CBOR parser and > identification. In the current project with Nutch, the Nutch > CommonCrawlDataDumper is used to dump the crawled segments to the files in > the format of CBOR. In order to read/parse those dumped files by this tool, > it would be great if tika is able to support parsing the cbor, the thing is > that the CommonCrawlDataDumper is not dumping with correct extension, it > dumps with its own rule, the default extension of the dumped file is html, so > it might be less painful if tika is able to detect and parse those files > without any pre-processing steps. > CommonCrawlDataDumper is calling the following to dump with cbor. > import com.fasterxml.jackson.dataformat.cbor.CBORFactory; > import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; > fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. > According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like > CBOR does not yet have its magic numbers to be detected/identified by other > applications (PFA: rfc_cbor.jpg) > It seems that the only way to inform other applications of the type as of now > is using the extension (i.e. .cbor), or probably content detection (i.e. byte > histogram distribution estimation). > There is another thing worth the attention, it looks like tika has attempted > to add the support with cbor mime detection in the tika-mimetypes.xml > (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the > cbor file dumped by CommonCrawlDataDumper. > According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a > self-describing Tag 55799 that seems to be used for cbor type > identification(the hex code might be 0xd9d9f7), but it is probably up to the > application that take care of this tag, and it is also possible that the > fasterxml that the nutch dumping tool is missing this tag, an example cbor > file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been > attached (PFA: 142440269.html). > The following info is cited from the rfc, "...a decoder might be able to > parse both CBOR and JSON. >Such a decoder would need to mechanically distinguish the two >formats. An easy way for an encoder to help the decoder would be to >tag the entire CBOR item with tag 55799, the serialization of which >will never be found at the beginning of a JSON text..." > It looks like the a file can have two parts/sections i.e. the plain text > parts and the json prettified by cbor, this might be also worth the attention > and consideration with the parsing and type identification. > On the other hand, it is worth noting that the entries for cbor extension > detection needs to be appended in the tika-mimetypes.xml too > e.g. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: NUTCH-1994 and UCAR Dependencies
Thanks Lewis! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney Reply-To: "dev@tika.apache.org" Date: Tuesday, April 21, 2015 at 7:14 PM To: "dev@tika.apache.org" Subject: Re: NUTCH-1994 and UCAR Dependencies >Hi Folks, >OK, so the final part of this jigsaw is as follows > >I've requested a staging area [0] on Sonatype OSSRH to release the MIT >licensed 3rd party bzip2 artifacts. >I had to Mavenize the project. I will submit this patch to the bzip2 >project and hopefully they will pull it in. If not then I will fork the >project and maintain it myself. > >[0] https://issues.sonatype.org/browse/OSSRH-15143 >[1] https://code.google.com/p/jbzip2/ > >On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney < >lewis.mcgibb...@gmail.com> wrote: > >> Hi Folks, >> Update >> >> On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> >>> >>> [ivy:resolve] :: >>> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found >>> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found >>> [ivy:resolve] :: >>> >> >> >> Both of the above are now on Maven Central. >> I had to fix a couple of issues in the jj2000 library, namely >> https://github.com/Unidata/jj2000/pull/3 which was blocking us. >> >> I'm moving on to deal with the final one >> >> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found >> >> I'll update in due course. >> Thanks >> Lewis >> > > > >-- >*Lewis*
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506359#comment-14506359 ] Chris A. Mattmann commented on TIKA-1610: - Applied Pull request #42 thanks [~Lukeliush]! {noformat} [chipotle:~/tmp/tika] mattmann% svn commit -m "WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by LukeLiush this closes #42" CHANGES.txt tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml SendingCHANGES.txt Sending tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Transmitting file data .. Committed revision 1675250. [chipotle:~/tmp/tika] mattmann% {noformat} Will look for improvements and the parser next, so will leave this open! > CBOR Parser and detection [improvement] > --- > > Key: TIKA-1610 > URL: https://issues.apache.org/jira/browse/TIKA-1610 > Project: Tika > Issue Type: New Feature > Components: detector, mime, parser >Affects Versions: 1.7 >Reporter: Luke sh >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: memex > Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, > rfc_cbor.jpg > > > CBOR is a data format whose design goals include the possibility of extremely > small code size, fairly small message size, and extensibility without the > need for version negotiation (cited from http://cbor.io/ ). > It would be great if Tika is able to provide the support with CBOR parser and > identification. In the current project with Nutch, the Nutch > CommonCrawlDataDumper is used to dump the crawled segments to the files in > the format of CBOR. In order to read/parse those dumped files by this tool, > it would be great if tika is able to support parsing the cbor, the thing is > that the CommonCrawlDataDumper is not dumping with correct extension, it > dumps with its own rule, the default extension of the dumped file is html, so > it might be less painful if tika is able to detect and parse those files > without any pre-processing steps. > CommonCrawlDataDumper is calling the following to dump with cbor. > import com.fasterxml.jackson.dataformat.cbor.CBORFactory; > import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; > fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. > According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like > CBOR does not yet have its magic numbers to be detected/identified by other > applications (PFA: rfc_cbor.jpg) > It seems that the only way to inform other applications of the type as of now > is using the extension (i.e. .cbor), or probably content detection (i.e. byte > histogram distribution estimation). > There is another thing worth the attention, it looks like tika has attempted > to add the support with cbor mime detection in the tika-mimetypes.xml > (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the > cbor file dumped by CommonCrawlDataDumper. > According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a > self-describing Tag 55799 that seems to be used for cbor type > identification(the hex code might be 0xd9d9f7), but it is probably up to the > application that take care of this tag, and it is also possible that the > fasterxml that the nutch dumping tool is missing this tag, an example cbor > file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been > attached (PFA: 142440269.html). > The following info is cited from the rfc, "...a decoder might be able to > parse both CBOR and JSON. >Such a decoder would need to mechanically distinguish the two >formats. An easy way for an encoder to help the decoder would be to >tag the entire CBOR item with tag 55799, the serialization of which >will never be found at the beginning of a JSON text..." > It looks like the a file can have two parts/sections i.e. the plain text > parts and the json prettified by cbor, this might be also worth the attention > and consideration with the parsing and type identification. > On the other hand, it is worth noting that the entries for cbor extension > detection needs to be appended in the tika-mimetypes.xml too > e.g. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/42 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1610: --- Assignee: Chris A. Mattmann > CBOR Parser and detection [improvement] > --- > > Key: TIKA-1610 > URL: https://issues.apache.org/jira/browse/TIKA-1610 > Project: Tika > Issue Type: New Feature > Components: detector, mime, parser >Affects Versions: 1.7 >Reporter: Luke sh >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: memex > Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, > rfc_cbor.jpg > > > CBOR is a data format whose design goals include the possibility of extremely > small code size, fairly small message size, and extensibility without the > need for version negotiation (cited from http://cbor.io/ ). > It would be great if Tika is able to provide the support with CBOR parser and > identification. In the current project with Nutch, the Nutch > CommonCrawlDataDumper is used to dump the crawled segments to the files in > the format of CBOR. In order to read/parse those dumped files by this tool, > it would be great if tika is able to support parsing the cbor, the thing is > that the CommonCrawlDataDumper is not dumping with correct extension, it > dumps with its own rule, the default extension of the dumped file is html, so > it might be less painful if tika is able to detect and parse those files > without any pre-processing steps. > CommonCrawlDataDumper is calling the following to dump with cbor. > import com.fasterxml.jackson.dataformat.cbor.CBORFactory; > import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; > fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. > According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like > CBOR does not yet have its magic numbers to be detected/identified by other > applications (PFA: rfc_cbor.jpg) > It seems that the only way to inform other applications of the type as of now > is using the extension (i.e. .cbor), or probably content detection (i.e. byte > histogram distribution estimation). > There is another thing worth the attention, it looks like tika has attempted > to add the support with cbor mime detection in the tika-mimetypes.xml > (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the > cbor file dumped by CommonCrawlDataDumper. > According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a > self-describing Tag 55799 that seems to be used for cbor type > identification(the hex code might be 0xd9d9f7), but it is probably up to the > application that take care of this tag, and it is also possible that the > fasterxml that the nutch dumping tool is missing this tag, an example cbor > file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been > attached (PFA: 142440269.html). > The following info is cited from the rfc, "...a decoder might be able to > parse both CBOR and JSON. >Such a decoder would need to mechanically distinguish the two >formats. An easy way for an encoder to help the decoder would be to >tag the entire CBOR item with tag 55799, the serialization of which >will never be found at the beginning of a JSON text..." > It looks like the a file can have two parts/sections i.e. the plain text > parts and the json prettified by cbor, this might be also worth the attention > and consideration with the parsing and type identification. > On the other hand, it is worth noting that the entries for cbor extension > detection needs to be appended in the tika-mimetypes.xml too > e.g. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506214#comment-14506214 ] Tim Allison commented on TIKA-1513: --- In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if we could add 0x00 at 30 and 31? I'm currently grepping the Common Crawl slice from Julien Nioche for files starting with 0x03, and I'm getting a vast majority ".dbf", but there are some that end in .dct, .ndx (dbf index?), .tfm, .ctg... Will report findings tomorrow. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Detection problem: Parsing scientific source codes for geoscientists
On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote: For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain Your first step them is probably to try to workout how to identify these files, and add suitable mime magic for them, if possible. At the same time, make sure the common file extensions for them are listed against their mime entries, and make sure we have mime entries for all of these formats I'd probably recommend creating one JIRA per format with detection issues, then use that to track the work to add/expand the mime type, attach a small sample file, add detection unit tests etc. Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? As Lewis has said, once detection is working, you'll then want to add the missing parsers. You might find that the current SourceCodeParser could, with a little bit of work, handle some of these formats itself. Additional libraries+parsers may well be needed for the others. I'd suggest one JIRA per format you want a parser for that we lack, then use those to track the work Good luck! Nick
Re: Detection problem: Parsing scientific source codes for geoscientists
Hi Ji-Hyun, On Tue, Apr 21, 2015 at 4:15 PM, wrote: > > FORTRAN (.f, .f90, f77) > Python (.py) > R (.R) > Matlab (.m) > GrADS (Grid Analysis and Display System) > (.gs) > NCL (NCAR Command Language) (.ncl) > IDL (Interactive Data Language) (.pro) > NICE list > > I checked Fortran and Matlab are included in tike-mimetypes.xml, but when > I used Tika to obtain content type of the files (with suffix .f, f90, .m), > but Tika detected these files as text/plain: > > ohjihyun% tika -m spctime.f > > Content-Encoding: ISO-8859-1 > Content-Length: 16613 > Content-Type: text/plain; charset=ISO-8859-1 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.txt.TXTParser > resourceName: spctime.f > > [SNIP] > Should I build a parser for each file format to get an exact content-type, > as Java has SourceCodeParser? As far as I know we have no parser for Fortran documents. You could try using the following Java project http://sourceforge.net/projects/fortran-parser/ It is dual licensed under Eclipse and BSD licenses. Hope this helps. Lewis
Re: NUTCH-1994 and UCAR Dependencies
Patch for Mavenizing the bzip2 project https://code.google.com/p/jbzip2/issues/detail?id=3 Lewis On Tue, Apr 21, 2015 at 4:14 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > OK, so the final part of this jigsaw is as follows > > I've requested a staging area [0] on Sonatype OSSRH to release the MIT > licensed 3rd party bzip2 artifacts. > I had to Mavenize the project. I will submit this patch to the bzip2 > project and hopefully they will pull it in. If not then I will fork the > project and maintain it myself. > > [0] https://issues.sonatype.org/browse/OSSRH-15143 > [1] https://code.google.com/p/jbzip2/ > > On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> Hi Folks, >> Update >> >> On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> >>> >>> [ivy:resolve] :: >>> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found >>> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found >>> [ivy:resolve] :: >>> >> >> >> Both of the above are now on Maven Central. >> I had to fix a couple of issues in the jj2000 library, namely >> https://github.com/Unidata/jj2000/pull/3 which was blocking us. >> >> I'm moving on to deal with the final one >> >> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found >> >> I'll update in due course. >> Thanks >> Lewis >> > > > > -- > *Lewis* > -- *Lewis*
Re: NUTCH-1994 and UCAR Dependencies
Hi Folks, OK, so the final part of this jigsaw is as follows I've requested a staging area [0] on Sonatype OSSRH to release the MIT licensed 3rd party bzip2 artifacts. I had to Mavenize the project. I will submit this patch to the bzip2 project and hopefully they will pull it in. If not then I will fork the project and maintain it myself. [0] https://issues.sonatype.org/browse/OSSRH-15143 [1] https://code.google.com/p/jbzip2/ On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > Update > > On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> >> >> [ivy:resolve] :: >> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found >> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found >> [ivy:resolve] :: >> > > > Both of the above are now on Maven Central. > I had to fix a couple of issues in the jj2000 library, namely > https://github.com/Unidata/jj2000/pull/3 which was blocking us. > > I'm moving on to deal with the final one > > [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found > > I'll update in due course. > Thanks > Lewis > -- *Lewis*
Re: NUTCH-1994 and UCAR Dependencies
Hi Folks, Update On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > > > [ivy:resolve] :: > [ivy:resolve] :: edu.ucar#jj2000;5.2: not found > [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found > [ivy:resolve] :: > Both of the above are now on Maven Central. I had to fix a couple of issues in the jj2000 library, namely https://github.com/Unidata/jj2000/pull/3 which was blocking us. I'm moving on to deal with the final one [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found I'll update in due course. Thanks Lewis
[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...
GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/42 add entry for cbor glob extension in the tika-mimetypes.xml You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika cborExtension Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 commit 5b86cccdfc6d637cb44c9f8b2642e438c2ae5ff4 Author: LukeLiush Date: 2015-04-21T21:39:07Z add entry for cbor glob extension in the tika-mimetypes.xml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: NUTCH-1994 and UCAR Dependencies
Hi Lewis, I also tried upgrading Tika in Nutch. But, ran into the same issue (but, udunits is found, as expected): [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: edu.ucar#jj2000;5.2: not found [ivy:retrieve] :: org.itadaki#bzip2;0.9.1: not found [ivy:retrieve] :: Thanks for pushing the dependencies out. Tyler On Tue, Apr 21, 2015 at 1:50 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > Whilst addressing NUTCH-1994, I've experienced a dependency problem > (related to unpublished artifacts on Maven Central) which I am working > through right now. > When Kaing the upgrade in Nutch, I get the following > > [ivy:resolve] -- artifact edu.ucar#udunits;4.5.5!udunits.jar: > [ivy:resolve] > > http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar > [ivy:resolve] :: > [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: > [ivy:resolve] :: > [ivy:resolve] :: edu.ucar#jj2000;5.2: not found > [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found > [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found > [ivy:resolve] :: > [ivy:resolve] > [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS > > BUILD FAILED > /usr/local/trunk_clean/build.xml:112: The following error occurred while > executing this line: > /usr/local/trunk_clean/src/plugin/build.xml:60: The following error > occurred while executing this line: > /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to > resolve dependencies: > resolve failed - see output for details > > Total time: 17 seconds > > I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they > will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2 > has a corrupted POM which means that OSS Nexus will not accepts it. I'll > send a pull request further upstream for that ASAP. > > Finally, the BZIP dependency is a 3rd party dependency from another Org, > Licensed under MIT license. So I will register interest to publish this > dependency, push it, then we will be good to go. > > Lewis > > > > -- > *Lewis* >
[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files
[ https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505633#comment-14505633 ] Tim Allison commented on TIKA-1601: --- I don't. That's half the fun of a patch, right. :) On the sqlite parser, I tried to have a least one column for each data type, nonascii language to confirm no encoding problems and an embedded doc. Happy to generate this if it would help. Thank you, again. > Integrate Jackcess to handle MSAccess files > --- > > Key: TIKA-1601 > URL: https://issues.apache.org/jira/browse/TIKA-1601 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > Recently, James Ahlborn, the current maintainer of > [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense > Jackcess to Apache 2.0. [~boneill], the CTO at [Health Market Science, a > LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with > this relicensing and led the charge to obtain all necessary corporate > approval to deliver a > [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to > Apache. As anyone who has tried to get corporate approval for anything > knows, this can sometimes require not a small bit of effort. > If I may speak on behalf of Tika and the larger Apache community, I offer a > sincere thanks to James, Brian and the other developers and contributors to > Jackcess!!! > Once the licensing info has been changed in Jackcess and the new release is > available in maven, we can integrate Jackcess into Tika and add a capability > to process MSAccess. > As a side note, I reached out to the developers and contributors to determine > if there were any objections. I couldn't find addresses for everyone, and > not everyone replied, but those who did offered their support to this move. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files
[ https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505377#comment-14505377 ] Luis Filipe Nassif commented on TIKA-1601: -- Great! Give me more 3 days to submit the patch. Do you have some Apache 2 MDB file for unit tests? > Integrate Jackcess to handle MSAccess files > --- > > Key: TIKA-1601 > URL: https://issues.apache.org/jira/browse/TIKA-1601 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > Recently, James Ahlborn, the current maintainer of > [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense > Jackcess to Apache 2.0. [~boneill], the CTO at [Health Market Science, a > LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with > this relicensing and led the charge to obtain all necessary corporate > approval to deliver a > [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to > Apache. As anyone who has tried to get corporate approval for anything > knows, this can sometimes require not a small bit of effort. > If I may speak on behalf of Tika and the larger Apache community, I offer a > sincere thanks to James, Brian and the other developers and contributors to > Jackcess!!! > Once the licensing info has been changed in Jackcess and the new release is > available in maven, we can integrate Jackcess into Tika and add a capability > to process MSAccess. > As a side note, I reached out to the developers and contributors to determine > if there were any objections. I couldn't find addresses for everyone, and > not everyone replied, but those who did offered their support to this move. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Detection problem: Parsing scientific source codes for geoscientists
Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505367#comment-14505367 ] Luis Filipe Nassif commented on TIKA-879: - Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)! > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505368#comment-14505368 ] Luis Filipe Nassif commented on TIKA-879: - Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)! > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
NUTCH-1994 and UCAR Dependencies
Hi Folks, Whilst addressing NUTCH-1994, I've experienced a dependency problem (related to unpublished artifacts on Maven Central) which I am working through right now. When Kaing the upgrade in Nutch, I get the following [ivy:resolve] -- artifact edu.ucar#udunits;4.5.5!udunits.jar: [ivy:resolve] http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: edu.ucar#jj2000;5.2: not found [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /usr/local/trunk_clean/build.xml:112: The following error occurred while executing this line: /usr/local/trunk_clean/src/plugin/build.xml:60: The following error occurred while executing this line: /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to resolve dependencies: resolve failed - see output for details Total time: 17 seconds I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2 has a corrupted POM which means that OSS Nexus will not accepts it. I'll send a pull request further upstream for that ASAP. Finally, the BZIP dependency is a 3rd party dependency from another Org, Licensed under MIT license. So I will register interest to publish this dependency, push it, then we will be good to go. Lewis -- *Lewis*
[jira] [Commented] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents
[ https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505358#comment-14505358 ] Hudson commented on TIKA-1611: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #639 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/639/]) TIKA-1611 -- allow RecursiveParserWrapper to catch exceptions caused by embedded documents (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675159) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java * /tika/trunk/tika-batch/src/test/java/org/apache/tika/util * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/test_recursive_embedded_npe.docx * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java > Allow RecursiveParserWrapper to catch exceptions from embedded documents > > > Key: TIKA-1611 > URL: https://issues.apache.org/jira/browse/TIKA-1611 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > > While parsing embedded documents, currently, if a parser hits an > EncryptedDocumentException or anything wrapped in a TikaException, the > Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: > {noformat} > DELEGATING_PARSER.parse( > newStream, > new EmbeddedContentHandler(new > BodyContentHandler(handler)), > metadata, context); > } catch (EncryptedDocumentException ede) { > // TODO: can we log a warning that we lack the password? > // For now, just skip the content > } catch (TikaException e) { > // TODO: can we log a warning somehow? > // Could not parse the entry, just skip the content > } finally { > tmp.close(); > } > {noformat} > For some applications, it might be better to store the stack trace of the > attachment that caused an exception. > The proposal would be to include the stack trace in the metadata object for > that particular attachment. > The user will be able to specify whether or not to store stack traces, and > the default will be to store stack traces. This will be a small change to > the legacy behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1612) Exceptions getting image data in PPT files
Tim Allison created TIKA-1612: - Summary: Exceptions getting image data in PPT files Key: TIKA-1612 URL: https://issues.apache.org/jira/browse/TIKA-1612 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Minor In numerous (~500) ppt files in govdocs1, we're getting zip exceptions (unknown compression method, bad block, etc) when Tika's HSLFExtractor calls {{getData()}} on an embedded image. Under normal circumstances (I just learned today...), if an attachment causes a RuntimeException, we are currently swallowing that in {{ParsingEmbeddedDocumentExtractor}}. However, because we're calling {{getData()}} before the embedded extractor takes over, if there is an exception there, the parse of the entire file fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1612) Exceptions getting image data in PPT files
[ https://issues.apache.org/jira/browse/TIKA-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505335#comment-14505335 ] Tim Allison commented on TIKA-1612: --- Not sure how we want to fix this. To make this parallel to our handling of other embedded files, we'd just swallow the exception...I really don't like that option. Recommendations? > Exceptions getting image data in PPT files > -- > > Key: TIKA-1612 > URL: https://issues.apache.org/jira/browse/TIKA-1612 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > In numerous (~500) ppt files in govdocs1, we're getting zip exceptions > (unknown compression method, bad block, etc) when Tika's HSLFExtractor calls > {{getData()}} on an embedded image. > Under normal circumstances (I just learned today...), if an attachment causes > a RuntimeException, we are currently swallowing that in > {{ParsingEmbeddedDocumentExtractor}}. > However, because we're calling {{getData()}} before the embedded extractor > takes over, if there is an exception there, the parse of the entire file > fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents
[ https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1611. --- Resolution: Fixed r1675159. Nothing like testing to see behavior, rather than assumptions. :( > Allow RecursiveParserWrapper to catch exceptions from embedded documents > > > Key: TIKA-1611 > URL: https://issues.apache.org/jira/browse/TIKA-1611 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > > While parsing embedded documents, currently, if a parser hits an > EncryptedDocumentException or anything wrapped in a TikaException, the > Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: > {noformat} > DELEGATING_PARSER.parse( > newStream, > new EmbeddedContentHandler(new > BodyContentHandler(handler)), > metadata, context); > } catch (EncryptedDocumentException ede) { > // TODO: can we log a warning that we lack the password? > // For now, just skip the content > } catch (TikaException e) { > // TODO: can we log a warning somehow? > // Could not parse the entry, just skip the content > } finally { > tmp.close(); > } > {noformat} > For some applications, it might be better to store the stack trace of the > attachment that caused an exception. > The proposal would be to include the stack trace in the metadata object for > that particular attachment. > The user will be able to specify whether or not to store stack traces, and > the default will be to store stack traces. This will be a small change to > the legacy behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents
[ https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1611: -- Description: While parsing embedded documents, currently, if a parser hits an EncryptedDocumentException or anything wrapped in a TikaException, the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: {noformat} DELEGATING_PARSER.parse( newStream, new EmbeddedContentHandler(new BodyContentHandler(handler)), metadata, context); } catch (EncryptedDocumentException ede) { // TODO: can we log a warning that we lack the password? // For now, just skip the content } catch (TikaException e) { // TODO: can we log a warning somehow? // Could not parse the entry, just skip the content } finally { tmp.close(); } {noformat} For some applications, it might be better to store the stack trace of the attachment that caused an exception. The proposal would be to include the stack trace in the metadata object for that particular attachment. The user will be able to specify whether or not to store stack traces, and the default will be to store stack traces. This will be a small change to the legacy behavior. was: While parsing embedded documents, currently, if a parser hits an Exception, the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: {noformat} DELEGATING_PARSER.parse( newStream, new EmbeddedContentHandler(new BodyContentHandler(handler)), metadata, context); } catch (EncryptedDocumentException ede) { // TODO: can we log a warning that we lack the password? // For now, just skip the content } catch (TikaException e) { // TODO: can we log a warning somehow? // Could not parse the entry, just skip the content } finally { tmp.close(); } {noformat} For some applications, it might be better to store the stack trace of the attachment that caused an exception. The proposal would be to include the stack trace in the metadata object for that particular attachment. The user will be able to specify whether or not to store stack traces, and the default will be to store stack traces. This will be a small change to the legacy behavior. > Allow RecursiveParserWrapper to catch exceptions from embedded documents > > > Key: TIKA-1611 > URL: https://issues.apache.org/jira/browse/TIKA-1611 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > > While parsing embedded documents, currently, if a parser hits an > EncryptedDocumentException or anything wrapped in a TikaException, the > Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: > {noformat} > DELEGATING_PARSER.parse( > newStream, > new EmbeddedContentHandler(new > BodyContentHandler(handler)), > metadata, context); > } catch (EncryptedDocumentException ede) { > // TODO: can we log a warning that we lack the password? > // For now, just skip the content > } catch (TikaException e) { > // TODO: can we log a warning somehow? > // Could not parse the entry, just skip the content > } finally { > tmp.close(); > } > {noformat} > For some applications, it might be better to store the stack trace of the > attachment that caused an exception. > The proposal would be to include the stack trace in the metadata object for > that particular attachment. > The user will be able to specify whether or not to store stack traces, and > the default will be to store stack traces. This will be a small change to > the legacy behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents
[ https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1611: -- Description: While parsing embedded documents, currently, if a parser hits an Exception, the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: {noformat} DELEGATING_PARSER.parse( newStream, new EmbeddedContentHandler(new BodyContentHandler(handler)), metadata, context); } catch (EncryptedDocumentException ede) { // TODO: can we log a warning that we lack the password? // For now, just skip the content } catch (TikaException e) { // TODO: can we log a warning somehow? // Could not parse the entry, just skip the content } finally { tmp.close(); } {noformat} For some applications, it might be better to store the stack trace of the attachment that caused an exception. The proposal would be to include the stack trace in the metadata object for that particular attachment. The user will be able to specify whether or not to store stack traces, and the default will be to store stack traces. This will be a small change to the legacy behavior. was: While parsing embedded documents, currently, if a parser hits an Exception, the parsing of the entire document comes to a grinding halt. For some applications, it might be better to catch the exception at the attachment level. The proposal would be to include the stack trace in the metadata object for that particular attachment. The user will be able to specify whether or not to catch embedded exceptions, and the default will be to catch embedded exceptions. This will be a small change to the legacy behavior. > Allow RecursiveParserWrapper to catch exceptions from embedded documents > > > Key: TIKA-1611 > URL: https://issues.apache.org/jira/browse/TIKA-1611 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > > While parsing embedded documents, currently, if a parser hits an Exception, > the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}: > {noformat} > DELEGATING_PARSER.parse( > newStream, > new EmbeddedContentHandler(new > BodyContentHandler(handler)), > metadata, context); > } catch (EncryptedDocumentException ede) { > // TODO: can we log a warning that we lack the password? > // For now, just skip the content > } catch (TikaException e) { > // TODO: can we log a warning somehow? > // Could not parse the entry, just skip the content > } finally { > tmp.close(); > } > {noformat} > For some applications, it might be better to store the stack trace of the > attachment that caused an exception. > The proposal would be to include the stack trace in the metadata object for > that particular attachment. > The user will be able to specify whether or not to store stack traces, and > the default will be to store stack traces. This will be a small change to > the legacy behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505269#comment-14505269 ] Tim Allison edited comment on TIKA-879 at 4/21/15 5:04 PM: --- Y, will do. Results probably tomorrow. This? was (Author: talli...@mitre.org): Y, will do. Results probably tomorrow. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505269#comment-14505269 ] Tim Allison commented on TIKA-879: -- Y, will do. Results probably tomorrow. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505178#comment-14505178 ] Jeremy B. Merrill commented on TIKA-1608: - It's the only one I've found so far out of 300,000ish documents (most of which are plain emails, few of which are .docs). > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document attached here fails with the > following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505172#comment-14505172 ] Luis Filipe Nassif commented on TIKA-1554: -- Actually r1667661 > Improve EMF file detection > -- > > Key: TIKA-1554 > URL: https://issues.apache.org/jira/browse/TIKA-1554 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.7 >Reporter: Luis Filipe Nassif >Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: nonEmf.dat > > > I am getting many files being incorrectly detected as application/x-emf. I > think the current magic is too common. According to MS documentation > (https://msdn.microsoft.com/en-us/library/cc230635.aspx and > https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved > to: > {code} > > EMF > <_comment>Extended Metafile > > > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505132#comment-14505132 ] Luis Filipe Nassif commented on TIKA-879: - Maybe we could keep the original magics and ADD the widened versions with a "\n" prefix to decrease the number of false positives (I have got a small number of them)? Could you try the widened magics with govdocs1 [~talli...@mitre.org]? > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505113#comment-14505113 ] Tim Allison commented on TIKA-1608: --- In govdocs1, there are 24 of these: {noformat} java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.sprm.SprmBuffer.append(SprmBuffer.java:128) at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:293) at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:116) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:136) at o.a.t.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) {noformat} There are 2 of those in our commoncrawl slice. Nothing that matches your trace, though. Thank you for attaching it. How common is this stack trace in your set? > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document attached here fails with the > following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill updated TIKA-1608: Description: Extracting text from the Word 97-2004 document attached here fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the "user" mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) was: Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the "user" mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document attached here fails with the > following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505102#comment-14505102 ] Jeremy B. Merrill commented on TIKA-1608: - POI bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=57843 > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document attached here fails with the > following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1607: --- Summary: Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata (was: Introduce new HashMap data structure for persitsence of Tika Metadata) > Introduce new arbitrary object key/values data structure for persitsence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.9 > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill updated TIKA-1608: Attachment: 1534-attachment.doc document failing under this bug > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document located here > (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails > with the following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505093#comment-14505093 ] Jeremy B. Merrill commented on TIKA-1608: - Hi Tim, I added the document. I'm totally cool with the document being viewed by the public. I can't really grant it to the ASF since I didn't create it. It's an attachment from an email in an email dump (http://jebemail.com) posted by former Florida govenor Jeb Bush. So whether it's usable is probably a question for the ASF's lawyers. But for the avoidance of doubt, I grant any rights that I might have in the document to the ASF. I'll open a POI bug. > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > Attachments: 1534-attachment.doc > > > Extracting text from the Word 97-2004 document located here > (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails > with the following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505092#comment-14505092 ] Tim Allison commented on TIKA-1513: --- Completely agree. Only 2,386 files. This is the table of the file extensions for files identified as application/octet-stream. ||File Extension||Count|| |dbase3|1664| |wp|362| |unk| 285| |gls| 60| |ileaf| 4| |sys| 3| |chp| 2| |lnk| 2| |mac| 2| |squeak|1| |bin| 1| Would very much appreciate what you find, and yes, we can certainly decrease the priority...I had my priorities backwards. Sorry. Obviously, if you find false positives, we'll back off to file suffix. I, too, was less than enthusiastic about a single byte mime id'er. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif closed TIKA-1554. Resolution: Fixed Fix Version/s: 1.8 Resolved in r4608ff5. Thanks. > Improve EMF file detection > -- > > Key: TIKA-1554 > URL: https://issues.apache.org/jira/browse/TIKA-1554 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.7 >Reporter: Luis Filipe Nassif >Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: nonEmf.dat > > > I am getting many files being incorrectly detected as application/x-emf. I > think the current magic is too common. According to MS documentation > (https://msdn.microsoft.com/en-us/library/cc230635.aspx and > https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved > to: > {code} > > EMF > <_comment>Extended Metafile > > > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505057#comment-14505057 ] Luis Filipe Nassif commented on TIKA-1513: -- No, I did not give a try to 0x03. How many files are detected as octet-stream in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am indexing ALL desktop files). I will test 0x03 and report the results here. Can we at least decrease the magic priority to 10 or 20 for now? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505054#comment-14505054 ] Ray Gauss II commented on TIKA-1607: We've had a few discussions on structured metadata over the years, some of which was captured in the [MetadataRoadmap Wiki page|http://wiki.apache.org/tika/MetadataRoadmap]. I'd agree that we should strive to maintain backwards compatibility for simple values. I think we should also consider serialization of the metadata store, not just in the {{Serializable}} interface sense, but perhaps being able to easily marshal the entire metadata store into JSON and XML. As [~gagravarr] points out, work has been done to express structured metadata via the existing metadata store. In that email thread you'll find reference to the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg]. > Introduce new HashMap data structure for persitsence of Tika > Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.9 > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests
[ https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505051#comment-14505051 ] Hudson commented on TIKA-1501: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #638 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/638/]) TIKA-1501: Fix disabled OSGi related unit tests. Fixes from Bob Paulin. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675121) * /tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java > Fix the disabled Tika Bundle OSGi related unit tests > > > Key: TIKA-1501 > URL: https://issues.apache.org/jira/browse/TIKA-1501 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.6, 1.7 >Reporter: Nick Burch > Fix For: 1.9 > > Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, > TIKA-1501.patch > > > Currently, the unit tests for the Tika Bundle contain several bits like: > {code} > @Ignore // TODO Fix this test > {code} > We should really fix these unit tests so they work, and re-enable them -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505042#comment-14505042 ] Moritz Dorka commented on TIKA-1315: I believe I could speed up the process by ultimately writing a unit test for the POI-part... I'm just having a hard time motivating myself to write unit tests for a few stupid getters. What you could also do is to hardcode {code}getLevelNumberingPlaceholderOffsets(){code} to always return {code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all (trivial) cases (however, I have not tested the reaction of my code to such cheating). There is also a very subtle bug left in my code which only triggers in ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find the time I will update my patch. > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [ANNOUNCE] Apache Tika 1.8 Released
Yay thanks Tyler! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , "Timothy B." Reply-To: "dev@tika.apache.org" Date: Tuesday, April 21, 2015 at 8:34 AM To: "dev@tika.apache.org" Subject: RE: [ANNOUNCE] Apache Tika 1.8 Released >Thank you, Tyler! > >-Original Message- >From: Tyler Palsulich [mailto:tpalsul...@apache.org] >Sent: Monday, April 20, 2015 5:09 PM >To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org >Subject: [ANNOUNCE] Apache Tika 1.8 Released > >The Apache Tika project is pleased to announce the release of Apache Tika >1.8. The release >contents have been pushed out to the main Apache release site and to the >Maven Central sync, so the releases should be available as soon as the >mirrors get the syncs. > >Apache Tika is a toolkit for detecting and extracting metadata and >structured text content >from various documents using existing parser libraries. > >Apache Tika 1.8 contains a number of improvements and bug fixes. Details >can be found in the changes file: >http://www.apache.org/dist/tika/CHANGES-1.8.txt > >Apache Tika is available in source form from the following download page: >http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip > >Apache Tika is also available in binary form or for use using Maven 2 from >the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ > >In the initial 48 hours, the release may not be available on all mirrors. >When downloading from a mirror site, please remember to verify the >downloads using signatures found on the Apache site: >https://people.apache.org/keys/group/tika.asc > >For more information on Apache Tika, visit the project home page: >http://tika.apache.org/ > >-- Tyler Palsulich, on behalf of the Apache Tika community
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505006#comment-14505006 ] Tim Allison commented on TIKA-1513: --- Y, I was concerned by that generally. Are you getting false positives with 0x03 specifically? I didn't find any in govdocs1, but I realize that corpus has limitations. Will add text/plain as supertype. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504996#comment-14504996 ] Luis Filipe Nassif commented on TIKA-1513: -- Hi Tim, I am ok with 1) and 2). But I think an one byte magic can result in many false positives, specially binary files. My current approach is detection by extension only. That needed a declaration of text/plain as a supertype. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505004#comment-14505004 ] Moritz Dorka commented on TIKA-1315: Well, the original patch by Filip is essentially an 80% solution. Everything that I added is rather obscure functionality... > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505008#comment-14505008 ] Tim Allison commented on TIKA-1315: --- Ha. Ok, but your patch is really well done. Let me take a look at Filip's. I'll see if we can find someone on POI to add that call soon. Thank you! > Basic list support in WordExtractor > --- > > Key: TIKA-1315 > URL: https://issues.apache.org/jira/browse/TIKA-1315 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Filip Bednárik >Priority: Minor > Fix For: 1.9 > > Attachments: ListManager.tar.bz2, ListNumbering.patch, > ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch > > > Hello guys, I am really sorry to post issue like this because I have no other > way of contacting you and I don't quite understand how you manage forks and > pull requests (I don't think you do that). Plus I don't know your coding > styles and stuff. > In my project I needed for tika to parse numbered lists from word .doc > documents, but TIKA doesn't support it. So I looked for solution and found > one here: > http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ > . So I adapted this solution to Apache TIKA with few fixes and improvements. > Anyway feel free to use any of it so it can help people who struggle with > lists in TIKA like I did. > Attached files are: > Updated test > Fixed WordExtractor > Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents
Tim Allison created TIKA-1611: - Summary: Allow RecursiveParserWrapper to catch exceptions from embedded documents Key: TIKA-1611 URL: https://issues.apache.org/jira/browse/TIKA-1611 Project: Tika Issue Type: Improvement Components: core Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.9 While parsing embedded documents, currently, if a parser hits an Exception, the parsing of the entire document comes to a grinding halt. For some applications, it might be better to catch the exception at the attachment level. The proposal would be to include the stack trace in the metadata object for that particular attachment. The user will be able to specify whether or not to catch embedded exceptions, and the default will be to catch embedded exceptions. This will be a small change to the legacy behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504999#comment-14504999 ] Sergey Beryozkin commented on TIKA-1607: Hi, IMHO it indeed makes sense to keep the existing Metadata methods that return String values but also offer an optional support for representing Metadata as a multivalued map of arbitrary object key/values where the original String to String[] pairs are converted into something more sophisticated if required... By the way, JAX-RS API has this interface: http://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MultivaluedMap.html Not suggesting to use natively in Tika, but it might be of interest... Cheers, Sergey > Introduce new HashMap data structure for persitsence of Tika > Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.9 > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests
[ https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1501. --- Resolution: Fixed Fix Version/s: 1.9 r1675121. Thank you, [~bobpaulin]! > Fix the disabled Tika Bundle OSGi related unit tests > > > Key: TIKA-1501 > URL: https://issues.apache.org/jira/browse/TIKA-1501 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.6, 1.7 >Reporter: Nick Burch > Fix For: 1.9 > > Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, > TIKA-1501.patch > > > Currently, the unit tests for the Tika Bundle contain several bits like: > {code} > @Ignore // TODO Fix this test > {code} > We should really fix these unit tests so they work, and re-enable them -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504951#comment-14504951 ] Tim Allison commented on TIKA-1513: --- >From govdocs1, it looks like first byte of 0X03 is a safe way to identify >these files. [This|http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml] was useful. Two mime type questions: 1) What should we use as the canonical mime type for .dbf files? Proposal: {{application/x-dbf}}. 2) What mimes should the parser "accept", or what should we include in the aliases? >From [filext.com|http://filext.com/file-extension/DBF]: * application/dbase * application/x-dbase * application/dbf * application/x-dbf * zz-application/zz-winassoc-dbf First attempt at mime definition: {noformat} {noformat} > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.9 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1532) DIF Parser
[ https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504904#comment-14504904 ] Konstantin Gribov commented on TIKA-1532: - {{text/\*+xml}} is quite unusual type. OTOH, there's a lot of {{application/\*+xml}} and {{application/vnd.\*+xml}} types in IANA media types list (http://www.iana.org/assignments/media-types/media-types.xhtml) > DIF Parser > -- > > Key: TIKA-1532 > URL: https://issues.apache.org/jira/browse/TIKA-1532 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Aakarsh Medleri Hire Math > Labels: memex > > MIME Type detection & content parser for .dif format -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [ANNOUNCE] Apache Tika 1.8 Released
Thank you, Tyler! -Original Message- From: Tyler Palsulich [mailto:tpalsul...@apache.org] Sent: Monday, April 20, 2015 5:09 PM To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org Subject: [ANNOUNCE] Apache Tika 1.8 Released The Apache Tika project is pleased to announce the release of Apache Tika 1.8. The release contents have been pushed out to the main Apache release site and to the Maven Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.8 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.8.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ -- Tyler Palsulich, on behalf of the Apache Tika community
[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504884#comment-14504884 ] Tim Allison commented on TIKA-1295: --- [~lewismc], +1 to adding potential for hierarchical metadata on TIKA-1607. We should ensure during the transition (and maybe forever), that users can still get strings fairly easily. > Make some Dublin Core items multi-valued > > > Key: TIKA-1295 > URL: https://issues.apache.org/jira/browse/TIKA-1295 > Project: Tika > Issue Type: Bug > Components: metadata >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.9 > > > According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, > dc:title, dc:description and dc:rights should allow multiple values because > of language alternatives. Unless anyone objects in the next few days, I'll > switch those to Property.toInternalTextBag() from Property.toInternalText(). > I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504871#comment-14504871 ] Tim Allison commented on TIKA-1608: --- [~jeremybmerrill], thank you for raising this issue. If you go to "More", there's an "Attach Files" option. As I'm sure you've done, please only attach files that are ok to share with the public, and please let us know if the file is "granted" to Apache under ASF 2.0 so that we can use it in unit tests in the future. I'll take a look at our govdocs1/CommonCrawl exceptions and see if I can find a doc in there that matches your stack trace. >From the stacktrace, it looks like the fix will have to be made at the POI >level. I could be wrong, though! If you haven't done so already, please open >a ticket on POI's >[bugzilla|https://bz.apache.org/bugzilla/buglist.cgi?quicksearch=poi&list_id=123825] > and add a hyperlink from there to here and vice versa so that we can track >progress over here. Thank you, again. > RuntimeException on extracting text from Word 97-2004 Document > -- > > Key: TIKA-1608 > URL: https://issues.apache.org/jira/browse/TIKA-1608 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Jeremy B. Merrill > > Extracting text from the Word 97-2004 document located here > (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails > with the following stacktrace: > $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text > 1534-attachment.doc > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@69af0db6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at > org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101) > at > org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109) > at > org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 5 more > I'm using trunk from Github, which I think is a flavor of 1.9. The document > opens properly in Word for Mac '11. > Happy to answer questions; I'm also on the "user" mailing list. If it's > relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put > that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Summary: CBOR Parser and detection [improvement] (was: CBOR Parser and detection improvement) > CBOR Parser and detection [improvement] > --- > > Key: TIKA-1610 > URL: https://issues.apache.org/jira/browse/TIKA-1610 > Project: Tika > Issue Type: New Feature > Components: detector, mime, parser >Affects Versions: 1.7 >Reporter: Luke sh >Priority: Trivial > Labels: memex > Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, > rfc_cbor.jpg > > > CBOR is a data format whose design goals include the possibility of extremely > small code size, fairly small message size, and extensibility without the > need for version negotiation (cited from http://cbor.io/ ). > It would be great if Tika is able to provide the support with CBOR parser and > identification. In the current project with Nutch, the Nutch > CommonCrawlDataDumper is used to dump the crawled segments to the files in > the format of CBOR. In order to read/parse those dumped files by this tool, > it would be great if tika is able to support parsing the cbor, the thing is > that the CommonCrawlDataDumper is not dumping with correct extension, it > dumps with its own rule, the default extension of the dumped file is html, so > it might be less painful if tika is able to detect and parse those files > without any pre-processing steps. > CommonCrawlDataDumper is calling the following to dump with cbor. > import com.fasterxml.jackson.dataformat.cbor.CBORFactory; > import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; > fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. > According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like > CBOR does not yet have its magic numbers to be detected/identified by other > applications (PFA: rfc_cbor.jpg) > It seems that the only way to inform other applications of the type as of now > is using the extension (i.e. .cbor), or probably content detection (i.e. byte > histogram distribution estimation). > There is another thing worth the attention, it looks like tika has attempted > to add the support with cbor mime detection in the tika-mimetypes.xml > (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the > cbor file dumped by CommonCrawlDataDumper. > According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a > self-describing Tag 55799 that seems to be used for cbor type > identification(the hex code might be 0xd9d9f7), but it is probably up to the > application that take care of this tag, and it is also possible that the > fasterxml that the nutch dumping tool is missing this tag, an example cbor > file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been > attached (PFA: 142440269.html). > The following info is cited from the rfc, "...a decoder might be able to > parse both CBOR and JSON. >Such a decoder would need to mechanically distinguish the two >formats. An easy way for an encoder to help the decoder would be to >tag the entire CBOR item with tag 55799, the serialization of which >will never be found at the beginning of a JSON text..." > It looks like the a file can have two parts/sections i.e. the plain text > parts and the json prettified by cbor, this might be also worth the attention > and consideration with the parsing and type identification. > On the other hand, it is worth noting that the entries for cbor extension > detection needs to be appended in the tika-mimetypes.xml too > e.g. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Description: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, "...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text..." It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. was: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, the