[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Attachment: cbor_tika.mimetypes.xml.jpg
rfc_cbor.jpg

 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049, there is a self-describing 
 Tag 55799 that seems to be used for cbor type identification, but it is 
 probably up to the application that take care of this tag, and it is also 
 possible that the fasterxml is not missing this tag. 
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the CommonCrawlDataDumper is 
a tool that comes with Nutch and it is used to dump the crawled segments to the 
files in the format of CBOR. In order to read/parse those dumped files by this 
tool, it would be great if tika is able to support the parsing and detecting, 
the surprise is that the CommonCrawlDataDumper is not dumping with correct 
extension, it dumps with its own rule, the default is html, so it might be less 
painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: 

[jira] [Created] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)
Luke sh created TIKA-1610:
-

 Summary: CBOR Parser and detection improvement
 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial


CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the CommonCrawlDataDumper is 
a tool that comes with Nutch and it is used to dump the crawled segments to the 
files in the format of CBOR. In order to read/parse those dumped files by this 
tool, it would be great if tika is able to support the parsing and detecting, 
the surprise is that the CommonCrawlDataDumper is not dumping with correct 
extension, it dumps with its own rule, the default is html, so it might be less 
painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
self-describing Tag 55799 that seems to be used for cbor type 
identification(the hex code might be 0xd9d9f7), but it is probably up to the 
application that take care of this tag, and it is also possible that the 
fasterxml that the nutch dumping tool is missing this tag, an example cbor file 
dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached 
(PFA: 142440269.html).
The following info is cited from the rfc, ...a decoder might be able to parse 
both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text...
It looks like the a file can have two parts/sections i.e. the plain text parts 
and the json prettified by cbor, this might be also worth the attention and 
consideration with the parsing and type identification.

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
e.g.
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to 

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is missing this tag, an example cbor file dumped by the Nutch tool 
i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: 

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
self-describing Tag 55799 that seems to be used for cbor type identification, 
but it is probably up to the application that take care of this tag, and it is 
also possible that the fasterxml is missing this tag, an example cbor file 
dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached 
(PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is missing this tag, an example cbor file dumped by the Nutch tool 
i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: 

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Attachment: 142440269.html

cbor file dumped by the nutch tool.

 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049, there is a self-describing 
 Tag 55799 that seems to be used for cbor type identification, but it is 
 probably up to the application that take care of this tag, and it is also 
 possible that the fasterxml is missing this tag, an example cbor file dumped 
 by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 
 142440269.html)
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Summary: CBOR Parser and detection [improvement]  (was: CBOR Parser and 
detection improvement)

 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1611.
---
Resolution: Fixed

r1675159.

Nothing like testing to see behavior, rather than assumptions. :(

 Allow RecursiveParserWrapper to catch exceptions from embedded documents
 

 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


 While parsing embedded documents, currently, if a parser hits an 
 EncryptedDocumentException or anything wrapped in a TikaException, the 
 Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
 {noformat}
 DELEGATING_PARSER.parse(
 newStream,
 new EmbeddedContentHandler(new 
 BodyContentHandler(handler)),
 metadata, context);
 } catch (EncryptedDocumentException ede) {
 // TODO: can we log a warning that we lack the password?
 // For now, just skip the content
 } catch (TikaException e) {
 // TODO: can we log a warning somehow?
 // Could not parse the entry, just skip the content
 } finally {
 tmp.close();
 }
 {noformat}
 For some applications, it might be better to store the stack trace of the 
 attachment that caused an exception.
 The proposal would be to include the stack trace in the metadata object for 
 that particular attachment.
 The user will be able to specify whether or not to store stack traces, and 
 the default will be to store stack traces.  This will be a small change to 
 the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1611:
--
Description: 
While parsing embedded documents, currently, if a parser hits an 
EncryptedDocumentException or anything wrapped in a TikaException, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.

  was:
While parsing embedded documents, currently, if a parser hits an Exception, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.


 Allow RecursiveParserWrapper to catch exceptions from embedded documents
 

 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


 While parsing embedded documents, currently, if a parser hits an 
 EncryptedDocumentException or anything wrapped in a TikaException, the 
 Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
 {noformat}
 DELEGATING_PARSER.parse(
 newStream,
 new EmbeddedContentHandler(new 
 BodyContentHandler(handler)),
 metadata, context);
 } catch (EncryptedDocumentException ede) {
 // TODO: can we log a warning that we lack the password?
 // For now, just skip the content
 } catch (TikaException e) {
 // TODO: can we log a warning somehow?
 // Could not parse the entry, just skip the content
 } finally {
 tmp.close();
 }
 {noformat}
 For some applications, it might be better to store the stack trace of the 
 attachment that caused an exception.
 The proposal would be to include the stack trace in the metadata object for 
 that particular attachment.
 The user will be able to specify whether or not to store stack traces, and 
 the default will be to store stack traces.  This will be a small change to 
 the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1612) Exceptions getting image data in PPT files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505335#comment-14505335
 ] 

Tim Allison commented on TIKA-1612:
---

Not sure how we want to fix this.  To make this parallel to our handling of 
other embedded files, we'd just swallow the exception...I really don't like 
that option.

Recommendations?

 Exceptions getting image data in PPT files
 --

 Key: TIKA-1612
 URL: https://issues.apache.org/jira/browse/TIKA-1612
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Minor

 In numerous (~500) ppt files in govdocs1, we're getting zip exceptions 
 (unknown compression method, bad block, etc) when Tika's HSLFExtractor calls 
 {{getData()}} on an embedded image.
 Under normal circumstances (I just learned today...), if an attachment causes 
 a RuntimeException, we are currently swallowing that in 
 {{ParsingEmbeddedDocumentExtractor}}.
 However, because we're calling {{getData()}} before the embedded extractor 
 takes over, if there is an exception there, the parse of the entire file 
 fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505358#comment-14505358
 ] 

Hudson commented on TIKA-1611:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #639 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/639/])
TIKA-1611 -- allow RecursiveParserWrapper to catch exceptions caused by 
embedded documents (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675159)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
* /tika/trunk/tika-batch/src/test/java/org/apache/tika/util
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/test_recursive_embedded_npe.docx
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java


 Allow RecursiveParserWrapper to catch exceptions from embedded documents
 

 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


 While parsing embedded documents, currently, if a parser hits an 
 EncryptedDocumentException or anything wrapped in a TikaException, the 
 Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
 {noformat}
 DELEGATING_PARSER.parse(
 newStream,
 new EmbeddedContentHandler(new 
 BodyContentHandler(handler)),
 metadata, context);
 } catch (EncryptedDocumentException ede) {
 // TODO: can we log a warning that we lack the password?
 // For now, just skip the content
 } catch (TikaException e) {
 // TODO: can we log a warning somehow?
 // Could not parse the entry, just skip the content
 } finally {
 tmp.close();
 }
 {noformat}
 For some applications, it might be better to store the stack trace of the 
 attachment that caused an exception.
 The proposal would be to include the stack trace in the metadata object for 
 that particular attachment.
 The user will be able to specify whether or not to store stack traces, and 
 the default will be to store stack traces.  This will be a small change to 
 the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505368#comment-14505368
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)!

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505367#comment-14505367
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)!

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1611:
--
Description: 
While parsing embedded documents, currently, if a parser hits an Exception, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.

  was:
While parsing embedded documents, currently, if a parser hits an Exception, the 
parsing of the entire document comes to a grinding halt.  For some 
applications, it might be better to catch the exception at the attachment level.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to catch embedded exceptions, 
and the default will be to catch embedded exceptions.  This will be a small 
change to the legacy behavior.


 Allow RecursiveParserWrapper to catch exceptions from embedded documents
 

 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


 While parsing embedded documents, currently, if a parser hits an Exception, 
 the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
 {noformat}
 DELEGATING_PARSER.parse(
 newStream,
 new EmbeddedContentHandler(new 
 BodyContentHandler(handler)),
 metadata, context);
 } catch (EncryptedDocumentException ede) {
 // TODO: can we log a warning that we lack the password?
 // For now, just skip the content
 } catch (TikaException e) {
 // TODO: can we log a warning somehow?
 // Could not parse the entry, just skip the content
 } finally {
 tmp.close();
 }
 {noformat}
 For some applications, it might be better to store the stack trace of the 
 attachment that caused an exception.
 The proposal would be to include the stack trace in the metadata object for 
 that particular attachment.
 The user will be able to specify whether or not to store stack traces, and 
 the default will be to store stack traces.  This will be a small change to 
 the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
Whilst addressing NUTCH-1994, I've experienced a dependency problem
(related to unpublished artifacts on Maven Central) which I am working
through right now.
When Kaing the upgrade in Nutch, I get the following

[ivy:resolve]   -- artifact edu.ucar#udunits;4.5.5!udunits.jar:
[ivy:resolve]
http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar
[ivy:resolve] ::
[ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::
[ivy:resolve] :: edu.ucar#jj2000;5.2: not found
[ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
[ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
[ivy:resolve] ::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

BUILD FAILED
/usr/local/trunk_clean/build.xml:112: The following error occurred while
executing this line:
/usr/local/trunk_clean/src/plugin/build.xml:60: The following error
occurred while executing this line:
/usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to
resolve dependencies:
resolve failed - see output for details

Total time: 17 seconds

I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they
will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2
has a corrupted POM which means that OSS Nexus will not accepts it. I'll
send a pull request further upstream for that ASAP.

Finally, the BZIP dependency is a 3rd party dependency from another Org,
Licensed under MIT license. So I will register interest to publish this
dependency, push it, then we will be good to go.

Lewis



-- 
*Lewis*


Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Oh, Ji-Hyun (329F-Affiliate)
Hi Tika friends,

I am currently engaged in a project funded by National Science Foundation. Our 
goal is to develop a research-friendly environment where geoscientists, like 
me, can easily find source codes they need. According to a survey, scientists 
spend a considerable amount of their time in processing data instead of doing 
actual science. Based on my experience as a climate scientist, there exist most 
frequently/typically used analysis tools in atmospheric science. Therefore, it 
could be helpful if these tools can be easily shared among scientists. The 
thing is that the tools are written in various scientific languages, so we are 
trying to provide the metadata of source codes stored in public repositories to 
help scientists select source code for their own usages.

For the first step, I listed up the file formats that widely used in climate 
science.

FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
Tika detected these files as text/plain:

ohjihyun% tika -m spctime.f

Content-Encoding: ISO-8859-1
Content-Length: 16613
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: spctime.f

ohjihyun% tika -m wavelet.m
Content-Encoding: ISO-8859-1
Content-Length: 5868
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: wavelet.m

I checked Tika can give correct content type (text/x-java-source) for Java file 
as:
ohjihyun% tika -m UrlParser.java
Content-Encoding: ISO-8859-1
Content-Length: 2178
Content-Type: text/x-java-source
LoC: 70
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
resourceName: UrlParser.java

Should I build a parser for each file format to get an exact content-type, as 
Java has SourceCodeParser?
Thank you in advance for your insightful comments.

Ji-Hyun


[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505377#comment-14505377
 ] 

Luis Filipe Nassif commented on TIKA-1601:
--

Great! Give me more 3 days to submit the patch. Do you have some Apache 2 MDB 
file for unit tests?

 Integrate Jackcess to handle MSAccess files
 ---

 Key: TIKA-1601
 URL: https://issues.apache.org/jira/browse/TIKA-1601
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison

 Recently, James Ahlborn, the current maintainer of 
 [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
 Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
 LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
 this relicensing and led the charge to obtain all necessary corporate 
 approval to deliver a 
 [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
 Apache.  As anyone who has tried to get corporate approval for anything 
 knows, this can sometimes require not a small bit of effort.
 If I may speak on behalf of Tika and the larger Apache community, I offer a 
 sincere thanks to James, Brian and the other developers and contributors to 
 Jackcess!!!
 Once the licensing info has been changed in Jackcess and the new release is 
 available in maven, we can integrate Jackcess into Tika and add a capability 
 to process MSAccess.
 As a side note, I reached out to the developers and contributors to determine 
 if there were any objections.  I couldn't find addresses for everyone, and 
 not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1612) Exceptions getting image data in PPT files

2015-04-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1612:
-

 Summary: Exceptions getting image data in PPT files
 Key: TIKA-1612
 URL: https://issues.apache.org/jira/browse/TIKA-1612
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Minor


In numerous (~500) ppt files in govdocs1, we're getting zip exceptions (unknown 
compression method, bad block, etc) when Tika's HSLFExtractor calls 
{{getData()}} on an embedded image.

Under normal circumstances (I just learned today...), if an attachment causes a 
RuntimeException, we are currently swallowing that in 
{{ParsingEmbeddedDocumentExtractor}}.

However, because we're calling {{getData()}} before the embedded extractor 
takes over, if there is an exception there, the parse of the entire file fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1532) DIF Parser

2015-04-21 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504904#comment-14504904
 ] 

Konstantin Gribov commented on TIKA-1532:
-

{{text/\*+xml}} is quite unusual type. OTOH, there's a lot of 
{{application/\*+xml}} and {{application/vnd.\*+xml}} types in IANA media types 
list (http://www.iana.org/assignments/media-types/media-types.xhtml)

 DIF Parser
 --

 Key: TIKA-1532
 URL: https://issues.apache.org/jira/browse/TIKA-1532
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Aakarsh Medleri Hire Math
  Labels: memex

 MIME Type detection  content parser for .dif format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505057#comment-14505057
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

No, I did not give a try to 0x03. How many files are detected as octet-stream 
in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am 
indexing ALL desktop files). I will test 0x03 and report the results here. Can 
we at least decrease the magic priority to 10 or 20 for now?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505051#comment-14505051
 ] 

Hudson commented on TIKA-1501:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #638 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/638/])
TIKA-1501: Fix disabled OSGi related unit tests. Fixes from Bob Paulin. 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675121)
* /tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java


 Fix the disabled Tika Bundle OSGi related unit tests
 

 Key: TIKA-1501
 URL: https://issues.apache.org/jira/browse/TIKA-1501
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.6, 1.7
Reporter: Nick Burch
 Fix For: 1.9

 Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, 
 TIKA-1501.patch


 Currently, the unit tests for the Tika Bundle contain several bits like:
 {code}
 @Ignore // TODO Fix this test
 {code}
 We should really fix these unit tests so they work, and re-enable them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata

2015-04-21 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1607:
---
Summary: Introduce new arbitrary object key/values data structure for 
persitsence of Tika Metadata  (was: Introduce new HashMapString, Object data 
structure for persitsence of Tika Metadata)

 Introduce new arbitrary object key/values data structure for persitsence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Attachment: 1534-attachment.doc

document failing under this bug

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document located here 
 (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
 with the following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505102#comment-14505102
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

POI bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505008#comment-14505008
 ] 

Tim Allison commented on TIKA-1315:
---

Ha.  Ok, but your patch is really well done.  Let me take a look at Filip's.  
I'll see if we can find someone on POI to add that call soon.  Thank you!

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504996#comment-14504996
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

Hi Tim,

I am ok with 1) and 2). But I think an one byte magic can result in many false 
positives, specially binary files. My current approach is detection by 
extension only. That needed a declaration of text/plain as a supertype.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata

2015-04-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505054#comment-14505054
 ] 

Ray Gauss II commented on TIKA-1607:


We've had a few discussions on structured metadata over the years, some of 
which was captured in the [MetadataRoadmap Wiki 
page|http://wiki.apache.org/tika/MetadataRoadmap].

I'd agree that we should strive to maintain backwards compatibility for simple 
values.

I think we should also consider serialization of the metadata store, not just 
in the {{Serializable}} interface sense, but perhaps being able to easily 
marshal the entire metadata store into JSON and XML.

As [~gagravarr] points out, work has been done to express structured metadata 
via the existing metadata store.  In that email thread you'll find reference to 
the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

 Introduce new HashMapString, Object data structure for persitsence of Tika 
 Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1554) Improve EMF file detection

2015-04-21 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif closed TIKA-1554.

   Resolution: Fixed
Fix Version/s: 1.8

Resolved in r4608ff5. Thanks.

 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: nonEmf.dat


 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505093#comment-14505093
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

Hi Tim,

I added the document. I'm totally cool with the document being viewed by the 
public. I can't really grant it to the ASF since I didn't create it. It's an 
attachment from an email in an email dump (http://jebemail.com) posted by 
former Florida govenor Jeb Bush. So whether it's usable is probably a question 
for the ASF's lawyers. 

But for the avoidance of doubt, I grant any rights that I might have in the 
document to the ASF.

I'll open a POI bug.

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document located here 
 (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
 with the following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505092#comment-14505092
 ] 

Tim Allison commented on TIKA-1513:
---

Completely agree.  

Only 2,386 files.

This is the table of the file extensions for files identified as 
application/octet-stream.

||File Extension||Count||
|dbase3|1664|
|wp|362|
|unk|   285|
|gls|   60|
|ileaf| 4|
|sys|   3|
|chp|   2|
|lnk|   2|
|mac|   2|
|squeak|1|
|bin|   1|

Would very much appreciate what you find, and yes, we can certainly decrease 
the priority...I had my priorities backwards.  Sorry.

Obviously, if you find false positives, we'll back off to file suffix.  I, too, 
was less than enthusiastic about a single byte mime id'er.

Thank you!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Description: 
Extracting text from the Word 97-2004 document attached here fails with the 
following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the user mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)


  was:
Extracting text from the Word 97-2004 document located here 
(https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with 
the following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the user mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)



 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread 

[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata

2015-04-21 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504999#comment-14504999
 ] 

Sergey Beryozkin commented on TIKA-1607:


Hi, 
IMHO it indeed makes sense to keep the existing Metadata methods that return 
String values but also offer an optional support for representing Metadata as a 
multivalued map of arbitrary object key/values where the original String to 
String[] pairs are converted into something more sophisticated if required...

By the way, JAX-RS API has this interface:
http://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MultivaluedMap.html

Not suggesting to use natively in Tika, but it might be of interest...

Cheers, Sergey



 Introduce new HashMapString, Object data structure for persitsence of Tika 
 Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1501.
---
   Resolution: Fixed
Fix Version/s: 1.9

r1675121.

Thank you, [~bobpaulin]!

 Fix the disabled Tika Bundle OSGi related unit tests
 

 Key: TIKA-1501
 URL: https://issues.apache.org/jira/browse/TIKA-1501
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.6, 1.7
Reporter: Nick Burch
 Fix For: 1.9

 Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, 
 TIKA-1501.patch


 Currently, the unit tests for the Tika Bundle contain several bits like:
 {code}
 @Ignore // TODO Fix this test
 {code}
 We should really fix these unit tests so they work, and re-enable them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1611:
-

 Summary: Allow RecursiveParserWrapper to catch exceptions from 
embedded documents
 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


While parsing embedded documents, currently, if a parser hits an Exception, the 
parsing of the entire document comes to a grinding halt.  For some 
applications, it might be better to catch the exception at the attachment level.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to catch embedded exceptions, 
and the default will be to catch embedded exceptions.  This will be a small 
change to the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505004#comment-14505004
 ] 

Moritz Dorka commented on TIKA-1315:


Well, the original patch by Filip is essentially an 80% solution. Everything 
that I added is rather obscure functionality...

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505006#comment-14505006
 ] 

Tim Allison commented on TIKA-1513:
---

Y, I was concerned by that generally.  Are you getting false positives with 
0x03 specifically?  I didn't find any in govdocs1, but I realize that corpus 
has limitations.

Will add text/plain as supertype.  Thank you!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504951#comment-14504951
 ] 

Tim Allison commented on TIKA-1513:
---

From govdocs1, it looks like first byte of 0X03 is a safe way to identify 
these files.  

[This|http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml] was 
useful.

Two mime type questions:
1)  What should we use as the canonical mime type for .dbf files?  Proposal: 
{{application/x-dbf}}.

2)  What mimes should the parser accept, or what should we include in the 
aliases?
From [filext.com|http://filext.com/file-extension/DBF]:
* application/dbase
* application/x-dbase
* application/dbf
* application/x-dbf
* zz-application/zz-winassoc-dbf

First attempt at mime definition:
{noformat}
  mime-type type=application/x-dbf
magic priority=100
  match value=0x03 type=string offset=0/
/magic
glob pattern=*.dbf/
glob pattern=*.dbase/
  /mime-type
{noformat}

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Apache Tika 1.8 Released

2015-04-21 Thread Mattmann, Chris A (3980)
Yay thanks Tyler!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, April 21, 2015 at 8:34 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: RE: [ANNOUNCE] Apache Tika 1.8 Released

Thank you, Tyler!

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@apache.org]
Sent: Monday, April 20, 2015 5:09 PM
To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
Subject: [ANNOUNCE] Apache Tika 1.8 Released

The Apache Tika project is pleased to announce the release of Apache Tika
1.8. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync, so the releases should be available as soon as the
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.8 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.8.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Tyler Palsulich, on behalf of the Apache Tika community



[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505042#comment-14505042
 ] 

Moritz Dorka commented on TIKA-1315:


I believe I could speed up the process by ultimately writing a unit test for 
the POI-part... I'm just having a hard time motivating myself to write unit 
tests for a few stupid getters.

What you could also do is to hardcode 
{code}getLevelNumberingPlaceholderOffsets(){code} to always return 
{code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all 
(trivial) cases (however, I have not tested the reaction of my code to such 
cheating).

There is also a very subtle bug left in my code which only triggers in 
ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find 
the time I will update my patch.

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.9

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504871#comment-14504871
 ] 

Tim Allison commented on TIKA-1608:
---

[~jeremybmerrill], thank you for raising this issue. If you go to More, 
there's an Attach Files option.  As I'm sure you've done, please only attach 
files that are ok to share with the public, and please let us know if the file 
is granted to Apache under ASF 2.0 so that we can use it in unit tests in the 
future.

I'll take a look at our govdocs1/CommonCrawl exceptions and see if I can find a 
doc in there that matches your stack trace.

From the stacktrace, it looks like the fix will have to be made at the POI 
level.  I could be wrong, though!  If you haven't done so already, please open 
a ticket on POI's 
[bugzilla|https://bz.apache.org/bugzilla/buglist.cgi?quicksearch=poilist_id=123825]
  and add a hyperlink from there to here and vice versa so that we can track 
progress over here.

Thank you, again.

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill

 Extracting text from the Word 97-2004 document located here 
 (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
 with the following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [ANNOUNCE] Apache Tika 1.8 Released

2015-04-21 Thread Allison, Timothy B.
Thank you, Tyler!

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@apache.org] 
Sent: Monday, April 20, 2015 5:09 PM
To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
Subject: [ANNOUNCE] Apache Tika 1.8 Released

The Apache Tika project is pleased to announce the release of Apache Tika
1.8. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync, so the releases should be available as soon as the
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.8 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.8.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Tyler Palsulich, on behalf of the Apache Tika community


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504884#comment-14504884
 ] 

Tim Allison commented on TIKA-1295:
---

[~lewismc], +1 to adding potential for hierarchical metadata on TIKA-1607.  We 
should ensure during the transition (and maybe forever), that users can still 
get strings fairly easily.

 Make some Dublin Core items multi-valued
 

 Key: TIKA-1295
 URL: https://issues.apache.org/jira/browse/TIKA-1295
 Project: Tika
  Issue Type: Bug
  Components: metadata
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
 dc:title, dc:description and dc:rights should allow multiple values because 
 of language alternatives.  Unless anyone objects in the next few days, I'll 
 switch those to Property.toInternalTextBag() from Property.toInternalText().  
 I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505113#comment-14505113
 ] 

Tim Allison commented on TIKA-1608:
---

In govdocs1, there are 24 of these:
{noformat}
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.poi.hwpf.sprm.SprmBuffer.append(SprmBuffer.java:128)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:293)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:116)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:136)
at o.a.t.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
{noformat}

There are 2 of those in our commoncrawl slice.

Nothing that matches your trace, though.  
Thank you for attaching it.  How common is this stack trace in your set?

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505132#comment-14505132
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Maybe we could keep the original magics and ADD the widened versions with a 
\n prefix to decrease the number of false positives (I have got a small 
number of them)? Could you try the widened magics with govdocs1 
[~talli...@mitre.org]?

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505172#comment-14505172
 ] 

Luis Filipe Nassif commented on TIKA-1554:
--

Actually r1667661

 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: nonEmf.dat


 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505178#comment-14505178
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

It's the only one I've found so far out of 300,000ish documents (most of which 
are plain emails, few of which are .docs).

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Tyler Palsulich
Hi Lewis,

I also tried upgrading Tika in Nutch. But, ran into the same issue
(but, udunits
is found, as expected):

[ivy:retrieve] ::
[ivy:retrieve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:retrieve] ::
[ivy:retrieve] :: edu.ucar#jj2000;5.2: not found
[ivy:retrieve] :: org.itadaki#bzip2;0.9.1: not found
[ivy:retrieve] ::

Thanks for pushing the dependencies out.

Tyler

On Tue, Apr 21, 2015 at 1:50 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Whilst addressing NUTCH-1994, I've experienced a dependency problem
 (related to unpublished artifacts on Maven Central) which I am working
 through right now.
 When Kaing the upgrade in Nutch, I get the following

 [ivy:resolve]   -- artifact edu.ucar#udunits;4.5.5!udunits.jar:
 [ivy:resolve]

 http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar
 [ivy:resolve] ::
 [ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED
 /usr/local/trunk_clean/build.xml:112: The following error occurred while
 executing this line:
 /usr/local/trunk_clean/src/plugin/build.xml:60: The following error
 occurred while executing this line:
 /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to
 resolve dependencies:
 resolve failed - see output for details

 Total time: 17 seconds

 I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they
 will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2
 has a corrupted POM which means that OSS Nexus will not accepts it. I'll
 send a pull request further upstream for that ASAP.

 Finally, the BZIP dependency is a 3rd party dependency from another Org,
 Licensed under MIT license. So I will register interest to publish this
 dependency, push it, then we will be good to go.

 Lewis



 --
 *Lewis*



[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...

2015-04-21 Thread LukeLiush
GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/42

add entry for cbor glob extension in the tika-mimetypes.xml



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika cborExtension

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit 5b86cccdfc6d637cb44c9f8b2642e438c2ae5ff4
Author: LukeLiush hanson311...@gmail.com
Date:   2015-04-21T21:39:07Z

add entry for cbor glob extension in the tika-mimetypes.xml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505633#comment-14505633
 ] 

Tim Allison commented on TIKA-1601:
---

I don't. That's half the fun of a patch, right. :) On the sqlite parser, I 
tried to have a least one column for each data type, nonascii language to 
confirm no encoding problems and an embedded doc.  

Happy to generate this if it would help. Thank you, again.

 Integrate Jackcess to handle MSAccess files
 ---

 Key: TIKA-1601
 URL: https://issues.apache.org/jira/browse/TIKA-1601
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison

 Recently, James Ahlborn, the current maintainer of 
 [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
 Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
 LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
 this relicensing and led the charge to obtain all necessary corporate 
 approval to deliver a 
 [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
 Apache.  As anyone who has tried to get corporate approval for anything 
 knows, this can sometimes require not a small bit of effort.
 If I may speak on behalf of Tika and the larger Apache community, I offer a 
 sincere thanks to James, Brian and the other developers and contributors to 
 Jackcess!!!
 Once the licensing info has been changed in Jackcess and the new release is 
 available in maven, we can integrate Jackcess into Tika and add a capability 
 to process MSAccess.
 As a side note, I reached out to the developers and contributors to determine 
 if there were any objections.  I couldn't find addresses for everyone, and 
 not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506214#comment-14506214
 ] 

Tim Allison commented on TIKA-1513:
---

In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if 
we could add 0x00 at 30 and 31?

I'm currently grepping the Common Crawl slice from Julien Nioche for files 
starting with 0x03, and I'm getting a vast majority .dbf, but there are some 
that end in .dct, .ndx (dbf index?), .tfm, .ctg...  Will report findings 
tomorrow.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1610:
---

Assignee: Chris A. Mattmann

 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506359#comment-14506359
 ] 

Chris A. Mattmann commented on TIKA-1610:
-

Applied Pull request #42 thanks [~Lukeliush]!

{noformat}
[chipotle:~/tmp/tika] mattmann% svn commit -m WIP Fix for TIKA-1610: Support 
MIME extension for CBOR files contributed by LukeLiush hanson311...@gmail.com 
this closes #42 CHANGES.txt 
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Transmitting file data ..
Committed revision 1675250.
[chipotle:~/tmp/tika] mattmann% 
{noformat}

Will look for improvements and the parser next, so will leave this open!


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...

2015-04-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/42


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Mattmann, Chris A (3980)
Thanks Lewis!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, April 21, 2015 at 7:14 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: NUTCH-1994 and UCAR Dependencies

Hi Folks,
OK, so the final part of this jigsaw is as follows

I've requested a staging area [0] on Sonatype OSSRH to release the MIT
licensed 3rd party bzip2 artifacts.
I had to Mavenize the project. I will submit this patch to the bzip2
project and hopefully they will pull it in. If not then I will fork the
project and maintain it myself.

[0] https://issues.sonatype.org/browse/OSSRH-15143
[1] https://code.google.com/p/jbzip2/

On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Update

 On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:



 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::



 Both of the above are now on Maven Central.
 I had to fix a couple of issues in the jj2000 library, namely
 https://github.com/Unidata/jj2000/pull/3 which was blocking us.

 I'm moving on to deal with the final one

 [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found

 I'll update in due course.
 Thanks
 Lewis




-- 
*Lewis*



Re: [memex-jpl] this week action from luke

2015-04-21 Thread Chris Mattmann
Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars,
Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Hi professor,


I think it highly depends on the content being read by tika, e.g. if
there is a sequence of bytes in the file that is being read and is the
same as one or more of mime types being defined in our tika-mimes.xml, I
guess that tika will put those types in its estimation list, please note
there could be multiple estimated mime types by magic-byte detection
approach. Now tika also considers the decision made by extension
detection approach, if extension says the file type it believes is the
first one in the magic type estimation list, then certainly the first one
will be returned. (the same applies to metadata hint approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not
scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it
seems that magic bytes will inevitable detect nothing, and I think it
will return the something like application/oct-stream that is the most
general type. As mentioned, tika favours the one that is the most
specialized, if extension approach returns the one that is more
specialized, in this extreme case I believe almost every type is a
subclass of this application/oct-stream therefore the answer in
this extreme may be yes, I think it is very possible that CBOR type
detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is
present in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation
list together with another one such as application/html, I guess the
order in the list probably also matters, the first one is preferred over
the next one. Also the decision from the extension detection approach
also play the role the break the tie.
e.g. if extension detection method agrees on cbor with one of the
estimated type in the magic list, then cbor will be returned. (again,
same thing applies to metadatahint method).

I have not taken a closer look at a cbor file that has the tag 55799, but
I expect to see its hex is something like 0xd9d9f7 or the tag should be
present in the header with a fixed sequence of
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
present in the file or preferable in the header (within a reasonable
range of bytes ), I believe it can probably be used as the magic numbers
for the cbor type.


There is another thing I have mentioned in the jira ticket I opened
yesterday against the cbor parser and detection, it is also possible that
cbor content can be imbedded inside a plain json file, the way that a
decoder can distinguish them in that file is by looking at the tag 55799
again. This may rarely happen but a robust parser might be able to take
care of that, tika might need to consider the use of fastXML being used
by the nutch tool when developing the cbor parser...
Again let me cite the same paragraph from the rfc,

 a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text.


Thanks
Luke



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there.
Also what happens if you tweak the definition of XHTML to not scan until
8192, but say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data 

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506414#comment-14506414
 ] 

Hudson commented on TIKA-1610:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #640 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/640/])
WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by 
LukeLiush hanson311...@gmail.com this closes #42 (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675250)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [memex-jpl] this week action from luke

2015-04-21 Thread Luke
Hi professor,


I think it highly depends on the content being read by tika, e.g. if there is a 
sequence of bytes in the file that is being read and is the same as one or more 
of mime types being defined in our tika-mimes.xml, I guess that tika will put 
those types in its estimation list, please note there could be multiple 
estimated mime types by magic-byte detection approach. Now tika also considers 
the decision made by extension detection approach, if extension says the file 
type it believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to metadata hint 
approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not scan 
until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems 
that magic bytes will inevitable detect nothing, and I think it will return the 
something like application/oct-stream that is the most general type. As 
mentioned, tika favours the one that is the most specialized, if extension 
approach returns the one that is more specialized, in this extreme case I 
believe almost every type is a subclass of this application/oct-stream 
therefore the answer in this extreme may be yes, I think it is very possible 
that CBOR type detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is present 
in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation list 
together with another one such as application/html, I guess the order in the 
list probably also matters, the first one is preferred over the next one. Also 
the decision from the extension detection approach also play the role the break 
the tie.
e.g. if extension detection method agrees on cbor with one of the estimated 
type in the magic list, then cbor will be returned. (again, same thing applies 
to metadatahint method). 

I have not taken a closer look at a cbor file that has the tag 55799, but I 
expect to see its hex is something like 0xd9d9f7 or the tag should be present 
in the header with a fixed sequence of 
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present 
in the file or preferable in the header (within a reasonable range of bytes ), 
I believe it can probably be used as the magic numbers for the cbor type.


There is another thing I have mentioned in the jira ticket I opened yesterday 
against the cbor parser and detection, it is also possible that cbor content 
can be imbedded inside a plain json file, the way that a decoder can 
distinguish them in that file is by looking at the tag 55799 again. This may 
rarely happen but a robust parser might be able to take care of that, tika 
might need to consider the use of fastXML being used by the nutch tool when 
developing the cbor parser...
Again let me cite the same paragraph from the rfc, 

 a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text.


Thanks
Luke



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there. Also 
what happens if you tweak the definition of XHTML to not scan until 8192, but 
say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 12:19 AM
To: Chris Mattmann chris.mattm...@gmail.com, Totaro, Giuseppe U 
(3980-Affiliate) tot...@di.uniroma1.it, Chris Mattmann 
chris.a.mattm...@jpl.nasa.gov
Cc: Bryant, Ann 

Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Nick Burch

On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
For the first step, I listed up the file formats that widely used in 
climate science.


FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but 
when I used Tika to obtain content type of the files (with suffix .f, 
f90, .m), but Tika detected these files as text/plain


Your first step them is probably to try to workout how to identify these 
files, and add suitable mime magic for them, if possible. At the same 
time, make sure the common file extensions for them are listed against 
their mime entries, and make sure we have mime entries for all of these 
formats


I'd probably recommend creating one JIRA per format with detection issues, 
then use that to track the work to add/expand the mime type, attach a 
small sample file, add detection unit tests etc.


Should I build a parser for each file format to get an exact 
content-type, as Java has SourceCodeParser?


As Lewis has said, once detection is working, you'll then want to add the 
missing parsers. You might find that the current SourceCodeParser could, 
with a little bit of work, handle some of these formats itself. Additional 
libraries+parsers may well be needed for the others. I'd suggest one JIRA 
per format you want a parser for that we lack, then use those to track the 
work


Good luck!

Nick


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
OK, so the final part of this jigsaw is as follows

I've requested a staging area [0] on Sonatype OSSRH to release the MIT
licensed 3rd party bzip2 artifacts.
I had to Mavenize the project. I will submit this patch to the bzip2
project and hopefully they will pull it in. If not then I will fork the
project and maintain it myself.

[0] https://issues.sonatype.org/browse/OSSRH-15143
[1] https://code.google.com/p/jbzip2/

On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Update

 On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:



 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::



 Both of the above are now on Maven Central.
 I had to fix a couple of issues in the jj2000 library, namely
 https://github.com/Unidata/jj2000/pull/3 which was blocking us.

 I'm moving on to deal with the final one

 [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found

 I'll update in due course.
 Thanks
 Lewis




-- 
*Lewis*


Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Lewis John Mcgibbney
Hi Ji-Hyun,

On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote:


 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)


NICE list



 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when
 I used Tika to obtain content type of the files (with suffix .f, f90, .m),
 but Tika detected these files as text/plain:

 ohjihyun% tika -m spctime.f

 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f


[SNIP]


 Should I build a parser for each file format to get an exact content-type,
 as Java has SourceCodeParser?


As far as I know we have no parser for Fortran documents.
You could try using the following Java project
http://sourceforge.net/projects/fortran-parser/
It is dual licensed under Eclipse and BSD licenses.
Hope this helps.
Lewis


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
Update

On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:



 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::



Both of the above are now on Maven Central.
I had to fix a couple of issues in the jj2000 library, namely
https://github.com/Unidata/jj2000/pull/3 which was blocking us.

I'm moving on to deal with the final one

[ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found

I'll update in due course.
Thanks
Lewis


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Patch for Mavenizing the bzip2 project
https://code.google.com/p/jbzip2/issues/detail?id=3
Lewis

On Tue, Apr 21, 2015 at 4:14 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 OK, so the final part of this jigsaw is as follows

 I've requested a staging area [0] on Sonatype OSSRH to release the MIT
 licensed 3rd party bzip2 artifacts.
 I had to Mavenize the project. I will submit this patch to the bzip2
 project and hopefully they will pull it in. If not then I will fork the
 project and maintain it myself.

 [0] https://issues.sonatype.org/browse/OSSRH-15143
 [1] https://code.google.com/p/jbzip2/

 On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Update

 On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:



 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::



 Both of the above are now on Maven Central.
 I had to fix a couple of issues in the jj2000 library, namely
 https://github.com/Unidata/jj2000/pull/3 which was blocking us.

 I'm moving on to deal with the final one

 [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found

 I'll update in due course.
 Thanks
 Lewis




 --
 *Lewis*




-- 
*Lewis*


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505269#comment-14505269
 ] 

Tim Allison commented on TIKA-879:
--

Y, will do. Results probably tomorrow.

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505269#comment-14505269
 ] 

Tim Allison edited comment on TIKA-879 at 4/21/15 5:04 PM:
---

Y, will do. Results probably tomorrow.

This?
mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0/
  match value=\nReturn-Path: type=string offset=0:1000/
 match value=From: type=string offset=0/
  match value=Received: type=string offset=0/
 match value=\nReceived: type=string offset=0:1000/
 match value=Message-ID: type=string offset=0/
  match value=\nMessage-ID: type=string offset=0:1000/
 match value=Date: type=string offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
sub-class-of type=text/plain/
  /mime-type



was (Author: talli...@mitre.org):
Y, will do. Results probably tomorrow.

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)