[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177058#comment-14177058
 ] 

William Palmer commented on TIKA-1302:
--


I have left the British Library (as of 20th October 2014).  Please contact 
maureen.penn...@bl.uk if you need to contact someone.

Any FOI requests should be sent to foi-enquir...@bl.uk.


**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-23 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623
 ] 

William Palmer commented on TIKA-1232:
--

I am currently out of the office and will be back on Thursday 26th June 2014.

Any FOI requests should be sent to foi-enquir...@bl.uk.


**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-19 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001532#comment-14001532
 ] 

William Palmer commented on TIKA-1302:
--

This one might be worth a look - https://github.com/openplanets/format-corpus - 
Some of the files there are (intentionally) broken, and some are there as 
examples of format features (i.e. PDF with password, embedded fonts etc)  If 
the license is not clear enough for any files then please raise an issue, sure 
people will be glad to help.

Unfortunately I can't share any of the web content I describe using in that 
blog post.

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-11 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930114#comment-13930114
 ] 

William Palmer commented on TIKA-1232:
--

Thanks Tim  everyone

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-04 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919863#comment-13919863
 ] 

William Palmer commented on TIKA-1232:
--

I am currently out of the office and will be back on Monday 11th March 2014.

Any FOI requests should be sent to foi-enquir...@bl.uk.



**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908291#comment-13908291
 ] 

William Palmer commented on TIKA-1232:
--

Hi Tim  Andy,

Thanks - your code works on my test files.  One question though - it appears 
that dc:format should be a mimetype, therefore should the Extended-Content-Type 
dc:format be an actual mimetype with version like application/pdf; 
version=A-1a, with A-1a overriding the pdf:PDFVersion 1.4?  

Thanks for this - much appreciated!

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-456) Support timeouts for parsers

2014-02-13 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900333#comment-13900333
 ] 

William Palmer commented on TIKA-456:
-

I've written a proof of concept TimeoutParser here: 
https://github.com/willp-bl/nanite/blob/master/nanite-hadoop/src/main/java/uk/bl/wap/hadoop/profiler/TimeoutParser.java

This times out correctly when using Tika 1.4 and the corrupt mp3 from TIKA-1179.



 Support timeouts for parsers
 

 Key: TIKA-456
 URL: https://issues.apache.org/jira/browse/TIKA-456
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ken Krugler
Assignee: Chris A. Mattmann

 There are a number of reasons why Tika could hang while parsing. One common 
 case is when a parser is fed an incomplete document, such as what happens 
 when limiting the amount of data fetched during a web crawl.
 One solution is to create a TikaCallable that wraps the Tika   parser, and 
 then use this with a FutureTask. For example, when using a ParsedDatum POJO 
 for the results of the parse operation, I do something like this:
 parser = new AutoDetectParser();
 CallableParsedDatum c = new TikaCallable(parser, contenthandler, 
 inputstream, metadata);
 FutureTaskParsedDatum task = new  FutureTaskParsedDatum(c);
 Thread t = new Thread(task);
 t.start();
 ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
 And TikaCallable() looks like:
 class TikaCallable implements CallableParsedDatum {
 public TikaCallable(Parser parser, ContentHandler handler, InputStream 
 is, Metadata metadata) {
 _parser = parser;
 _handler = handler;
 _input = is;
 _metadata = metadata;
 ...
 }
 public ParsedDatum call() throws Exception {
 
 _parser.parse(_input, _handler, _metadata, new ParseContext());
 
 }
 }
 This seems like it would be generally useful, as I doubt that we'd  ever be 
 able to guarantee that none of the parsers being wrapped by Tika could ever 
 hang.
 One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
 something like:
   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
 Then the call to p.parse(...) would create a Callable (similar to the code 
 above) and use the specified timeout when calling task.get().
 One minus with this approach is that it creates a new thread for each parse 
 request, but I don't think the thread overhead is significant when compared 
 to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-456) Support timeouts for parsers

2014-02-13 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900360#comment-13900360
 ] 

William Palmer commented on TIKA-456:
-

AFAIK it's still deprecated.  If a library used by Tika hangs, or Tika itself 
hangs, that causes my code to hang and this is less desirable than using 
Thread.stop().  If there is a non-deprecated way to stop a thread I will gladly 
use that approach.  

I am using Tika with Hadoop to parse files in web archives (93 million files) 
and a hang causes the map to fail which then causes the job to fail.

 Support timeouts for parsers
 

 Key: TIKA-456
 URL: https://issues.apache.org/jira/browse/TIKA-456
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ken Krugler
Assignee: Chris A. Mattmann

 There are a number of reasons why Tika could hang while parsing. One common 
 case is when a parser is fed an incomplete document, such as what happens 
 when limiting the amount of data fetched during a web crawl.
 One solution is to create a TikaCallable that wraps the Tika   parser, and 
 then use this with a FutureTask. For example, when using a ParsedDatum POJO 
 for the results of the parse operation, I do something like this:
 parser = new AutoDetectParser();
 CallableParsedDatum c = new TikaCallable(parser, contenthandler, 
 inputstream, metadata);
 FutureTaskParsedDatum task = new  FutureTaskParsedDatum(c);
 Thread t = new Thread(task);
 t.start();
 ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
 And TikaCallable() looks like:
 class TikaCallable implements CallableParsedDatum {
 public TikaCallable(Parser parser, ContentHandler handler, InputStream 
 is, Metadata metadata) {
 _parser = parser;
 _handler = handler;
 _input = is;
 _metadata = metadata;
 ...
 }
 public ParsedDatum call() throws Exception {
 
 _parser.parse(_input, _handler, _metadata, new ParseContext());
 
 }
 }
 This seems like it would be generally useful, as I doubt that we'd  ever be 
 able to guarantee that none of the parsers being wrapped by Tika could ever 
 hang.
 One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
 something like:
   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
 Then the call to p.parse(...) would create a Callable (similar to the code 
 above) and use the specified timeout when calling task.get().
 One minus with this approach is that it creates a new thread for each parse 
 request, but I don't think the thread overhead is significant when compared 
 to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread William Palmer (JIRA)
William Palmer created TIKA-1232:


 Summary: Add PDF version to PDFParser output
 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Priority: Minor
 Attachments: pdfversion.patch

I'd like to identify the PDF version of files, this is not currently reported 
by the PDFParser although the information is available via PDFBox.  I have 
attached a patch that adds the format version to the Metadata object.

However, I am not familiar enough with the Tika source to know if an 
alternative metadata key should be used, or this new one added.

Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)