[jira] [Commented] (TIKA-1436) improvement to PDFParser

2015-02-07 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310638#comment-14310638
 ] 

Stefano Fornari commented on TIKA-1436:
---

ups, I did not notice this needed some background. As per the mentioned thread 
on the mailing list, which I am reporting below for your conveninece, I believe 
there was consensus that the current pattern is not the best and it is 
difficult to understand. I am not sure instead what you report about many not 
related changes in method/variables. I quickly had a look at the patch and I 
could not find any. can you please point it out?

thanks in advance,

> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).

Yes, the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the
program. However, in this case we considered it worth doing as the
alternative would have been far more complicated.

Basically we wanted to avoid having to modify each parser
implementation (even those implemented outside Tika...) to keep track
of how much content has already been extracted and instead do that
just once in the WriteOutContentHandler class. However, the only way
for the WriteOutContentHandler to signal that parsing should be
stopped is by throwing a SAXException, which is what we're doing here.
By catching the exception and inspecting it with isWriteLimitReached()
the client can determine whether this is what happened.

BR,

Jukka Zitting

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310709#comment-14310709
 ] 

Nick Burch commented on TIKA-1539:
--

Thanks for this. I've made a minor tweak in r1658056 - the indents weren't 
quite consistent, and the entry wanted to be further up so the file remains 
sorted

> GRB file magic bytes and extension matching 
> 
>
> Key: TIKA-1539
> URL: https://issues.apache.org/jira/browse/TIKA-1539
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
>
> GRB type detection with magic bytes and extension probably needs to be 
> supported Tika, the GRB parser is under development, so it may be good to 
> have its magic bytes and extension matching detection.
> However, GRB does not have standard mime type, the following extension and 
> MAGIC matching settings in the tika-mimetypes.xml are proposed to used for 
> GRB mime type idenfication.
> 
> GRIB
> <_comment>General Regularly-distributed Information in Binary 
> form
> http://en.wikipedia.org/wiki/GRIB
> 
>   
> 
> 
> 
> 
> Any kind suggestion and advice will be welcomed and appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310711#comment-14310711
 ] 

Nick Burch commented on TIKA-1541:
--

I'm not sure if we want to be activating this by default for 
application/octet-stream - it could surprise users of "unsupported" files, 
could add to the processing time and memory of those files, and means that we'd 
have the unexpected case that "known but unsupported mime type" would have less 
returned than "unknown mime type"!

Once we have the different parser strategy stuff in place (see TIKA-1509), I 
could very much see this being great as a default in the "give me all you can" 
situation (in place of the current EmptyParser)

Until we have that in place, I think we probably ought to not register it in 
the parsers list. Otherwise, people will suddenly find processing times go up, 
and those who've put time into getting mime types defined + detected for their 
parser-less formats will end up worse off than those who haven't

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310716#comment-14310716
 ] 

Hudson commented on TIKA-1539:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #477 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/477/])
TIKA-1539 Fix indent, and move the GRIB and XQuery mime entries to the right 
place in the sorted list (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1658056)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> GRB file magic bytes and extension matching 
> 
>
> Key: TIKA-1539
> URL: https://issues.apache.org/jira/browse/TIKA-1539
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
>
> GRB type detection with magic bytes and extension probably needs to be 
> supported Tika, the GRB parser is under development, so it may be good to 
> have its magic bytes and extension matching detection.
> However, GRB does not have standard mime type, the following extension and 
> MAGIC matching settings in the tika-mimetypes.xml are proposed to used for 
> GRB mime type idenfication.
> 
> GRIB
> <_comment>General Regularly-distributed Information in Binary 
> form
> http://en.wikipedia.org/wiki/GRIB
> 
>   
> 
> 
> 
> 
> Any kind suggestion and advice will be welcomed and appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310814#comment-14310814
 ] 

Chris A. Mattmann commented on TIKA-1539:
-

Thanks Nick you are awesome.

> GRB file magic bytes and extension matching 
> 
>
> Key: TIKA-1539
> URL: https://issues.apache.org/jira/browse/TIKA-1539
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
>
> GRB type detection with magic bytes and extension probably needs to be 
> supported Tika, the GRB parser is under development, so it may be good to 
> have its magic bytes and extension matching detection.
> However, GRB does not have standard mime type, the following extension and 
> MAGIC matching settings in the tika-mimetypes.xml are proposed to used for 
> GRB mime type idenfication.
> 
> GRIB
> <_comment>General Regularly-distributed Information in Binary 
> form
> http://en.wikipedia.org/wiki/GRIB
> 
>   
> 
> 
> 
> 
> Any kind suggestion and advice will be welcomed and appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310812#comment-14310812
 ] 

Chris A. Mattmann commented on TIKA-1541:
-

Agreed Nick. So are you good with this in terms of code for now, but we remove 
it from the SPI list?

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques

2015-02-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1540:

Description: This will be a web-service client based parser to perform 
image feature extraction using Computer Vision techniques. 

> New Tika plugin for image based feature extraction using computer vision 
> techniques
> ---
>
> Key: TIKA-1540
> URL: https://issues.apache.org/jira/browse/TIKA-1540
> Project: Tika
>  Issue Type: New Feature
> Environment: cross platform
>Reporter: Aashish Chaudhary
>Assignee: Lewis John McGibbney
> Fix For: 1.8
>
>
> This will be a web-service client based parser to perform image feature 
> extraction using Computer Vision techniques. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2015-02-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1456:

Labels: gsoc2015  (was: )

> Visual Sentiment API parser
> ---
>
> Key: TIKA-1456
> URL: https://issues.apache.org/jira/browse/TIKA-1456
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gsoc2015
> Fix For: 1.8
>
>
> Integrate the Visual Sentibank API as a parser for images. We can use 
> Aperture from CMU, it's released under the MIT license:
> https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1541:

Attachment: TIKA-1541.TotaroMattmann.020615.patch.txt

- updated patch, doesn't include SPI activation

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310879#comment-14310879
 ] 

Nick Burch commented on TIKA-1541:
--

Looks like we still need some unit tests for this? Would want an {{assume}} 
along the lines of the Tesseract tests, to allow it to cleanly skip if the 
required programs aren't there are test time

Also, is there a reason why the call to the {{file}} command isn't done the 
same way as the {{strings}} command? One supports a configurable path and has 
checks/handling for if the program is there or not, the other seems to just 
blindly trust the program is on the path (which on windows etc it won't be)

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310881#comment-14310881
 ] 

Chris A. Mattmann commented on TIKA-1541:
-

Thanks [~gagravarr] - [~gostep] - please see Nick's comments - can you send an 
updated patch?

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run

2015-02-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310884#comment-14310884
 ] 

Tyler Palsulich commented on TIKA-1334:
---

This dump actually seems pretty good... Not very pretty, and there's a lot of 
info to digest, but how is it generated?

> Add presentation layer for results of each run
> --
>
> Key: TIKA-1334
> URL: https://issues.apache.org/jira/browse/TIKA-1334
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: static_stats.zip
>
>
> If I'm doing this, it'll probably be vintage mid-90s html.  If someone with 
> some .js kung-fu wants to take this, please do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1331) Find/configure a vm and gather initial corpus

2015-02-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310888#comment-14310888
 ] 

Tyler Palsulich commented on TIKA-1331:
---

So, we would be starting a Docker container on the Rackspace VM? That would 
definitely make deployment easier than the steps on the wiki (even if we put 
them in a script).

[~talli...@apache.org], this issue is fairly tied to TIKA-1302, right? We can't 
run the code from there until we have the VM configured.

I tried starting tika-server on port 80, but requests weren't successful. So, 
I'm thinking iptables isn't set up properly. But... in my couple hours of 
fiddling with it, I'm stumped.

> Find/configure a vm and gather initial corpus
> -
>
> Key: TIKA-1331
> URL: https://issues.apache.org/jira/browse/TIKA-1331
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>Assignee: Tim Allison
> Attachments: tika-eval-vm-setup.tar.bz2
>
>
> Let's start with govdocs1 for this issue unless there are other easy options. 
>  Going forward, we'll want and need to add a more diverse set of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310899#comment-14310899
 ] 

Luis Filipe Nassif commented on TIKA-1541:
--

Hi Chris, I definitely agree Giuseppe's patch is a great start!

But see that in TIKA-1483 I said I have a specific implementation for 
extracting Latin1 scripts coded with ISO8859-1, UTF8 and UTF16 charsets at the 
same time (less general than proposed in the issue) and asked if it would be of 
interest. If the community still thinks it would be useful, I will submit a 
patch.

A possible improvement to Giuseppe's patch is to let the user configure the 
encoding parameter of unix strings, it is not hard to write and is a powerful 
configuration.

I agree to not enable it by default for octet-stream, as I also suggested to 
not enable TesseractOCRParser by default in the past, they can add a lot of 
time to parsing and surprise users as Nick pointed.

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310902#comment-14310902
 ] 

Chris A. Mattmann commented on TIKA-1541:
-

Thanks [~lfcnassif] awesome. It would be great to bring your two patches 
together. This would be awesome. [~gostep] mentioned to me he would be working 
on an update to his patch this afternoon so you may want to reach out to him. 
Cheers to you both.

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310905#comment-14310905
 ] 

Luis Filipe Nassif commented on TIKA-1541:
--

Another suggestion, I think the parser should not set the contentType as 
octet-stream, so it can be used to parse known types without a specific parser 
and corrupted files that caused their parsers to throw an exception.

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311065#comment-14311065
 ] 

Giuseppe Totaro commented on TIKA-1541:
---

Hi all,
you can find the updated patch and a file for testing in attachment. 

[~gagravarr] I agree to not enable {{StringsParser}} by default for 
{{application/octet-stream}}. However, I think it may be a good solution for 
digital investigation purposes as made by many forensic tools (e.g., FTK) that 
run {{strings}} (or a similar program) over undetected files in order to 
extract text for building an inverted index. 
I added a configuration class for {{file}} command ({{FileConfig}}). I added 
also "essential" unit tests for {{StringsParser}}, {{StringsConfig}}, and 
{{FileConfig}}.

[~lfcnassif]] I thought a more general configuration class for strings-like 
commands (using the {{setOptions}} method). 
Since we are focusing on {{strings}} command, I updated the {{StringsConfig}} 
class supporting {{min-len}} and {{encoding}} options. The {{StringsEncoding}} 
enumeration contains all values allowed for {{strings}}. As far as I 
understood, the {{-e}} option for encoding is not supported yet by the Windows 
version. Now the parser does not set the Content-Type as {{octet-stream}}.

Thanks [~chrismattmann], [~gagravarr] and [~lfcnassif] for supporting me.

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.patch, 
> testOCTET_header.dbase3
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1541:
--
Attachment: TIKA-1541.TotaroMattmannBurchNassif.020715.patch
testOCTET_header.dbase3

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.patch, 
> testOCTET_header.dbase3
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311065#comment-14311065
 ] 

Giuseppe Totaro edited comment on TIKA-1541 at 2/8/15 2:24 AM:
---

Hi all,
you can find the updated patch and a file for testing in attachment. 

[~gagravarr] I agree to not enable {{StringsParser}} by default for 
{{application/octet-stream}}. However, I think it may be a good solution for 
digital investigation purposes as made by many forensic tools (e.g., FTK) that 
run {{strings}} (or a similar program) over undetected files in order to 
extract text for building an inverted index. 
I added a configuration class for {{file}} command ({{FileConfig}}). I added 
also "essential" unit tests for {{StringsParser}}, {{StringsConfig}}, and 
{{FileConfig}}.

[~lfcnassif] I thought a more general configuration class for strings-like 
commands (using the {{setOptions}} method). 
Since we are focusing on {{strings}} command, I updated the {{StringsConfig}} 
class supporting {{min-len}} and {{encoding}} options. The {{StringsEncoding}} 
enumeration contains all values allowed for {{strings}}. As far as I 
understood, the {{-e}} option for encoding is not supported yet by the Windows 
version. Now the parser does not set the Content-Type as {{octet-stream}}.

Thanks [~chrismattmann], [~gagravarr] and [~lfcnassif] for supporting me.


was (Author: gostep):
Hi all,
you can find the updated patch and a file for testing in attachment. 

[~gagravarr] I agree to not enable {{StringsParser}} by default for 
{{application/octet-stream}}. However, I think it may be a good solution for 
digital investigation purposes as made by many forensic tools (e.g., FTK) that 
run {{strings}} (or a similar program) over undetected files in order to 
extract text for building an inverted index. 
I added a configuration class for {{file}} command ({{FileConfig}}). I added 
also "essential" unit tests for {{StringsParser}}, {{StringsConfig}}, and 
{{FileConfig}}.

[~lfcnassif]] I thought a more general configuration class for strings-like 
commands (using the {{setOptions}} method). 
Since we are focusing on {{strings}} command, I updated the {{StringsConfig}} 
class supporting {{min-len}} and {{encoding}} options. The {{StringsEncoding}} 
enumeration contains all values allowed for {{strings}}. As far as I 
understood, the {{-e}} option for encoding is not supported yet by the Windows 
version. Now the parser does not set the Content-Type as {{octet-stream}}.

Thanks [~chrismattmann], [~gagravarr] and [~lfcnassif] for supporting me.

> StringsParser: a simple strings-based parser for Tika
> -
>
> Key: TIKA-1541
> URL: https://issues.apache.org/jira/browse/TIKA-1541
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
> Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmann.020615.patch.txt, 
> TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.patch, 
> testOCTET_header.dbase3
>
>
> I thought to implement an extremely simple implementation of 
> {{StringsParser}}, a parser based on the {{strings}} command (or 
> {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
> for undetected files. It is a preliminary work (you can see a lot of todos). 
> It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
> in attachment.
> I created a GitHub 
> [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
> code. As first test, you can clone the repo, build the code using the 
> {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
> some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
> "016" subset) detected as {{application/octet-stream}}. The latter script 
> launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting 
> ASCII strings from undetected filetypes. As far as I understood, many 
> "sophisticated" forensics tools work in a similar manner for indexing 
> purposes. They use a sort of {{strings}} command against files that they are 
> not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} 
> launches the {{file}} command on undetected files and then writes the output 
> in the {{strings:file_output}} property (I noticed that sometimes the 
> {{file}} command is able to detect the media type for documents not detected 
> by Tika).
> Finally, you can fine an old discussion about this topic 
> [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
> Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(