[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310899#comment-14310899
 ] 

Luis Filipe Nassif commented on TIKA-1541:
------------------------------------------

Hi Chris, I definitely agree Giuseppe's patch is a great start!

But see that in TIKA-1483 I said I have a specific implementation for 
extracting Latin1 scripts coded with ISO8859-1, UTF8 and UTF16 charsets at the 
same time (less general than proposed in the issue) and asked if it would be of 
interest. If the community still thinks it would be useful, I will submit a 
patch.

A possible improvement to Giuseppe's patch is to let the user configure the 
encoding parameter of unix strings, it is not hard to write and is a powerful 
configuration.

I agree to not enable it by default for octet-stream, as I also suggested to 
not enable TesseractOCRParser by default in the past, they can add a lot of 
time to parsing and surprise users as Nick pointed.

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to