[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177058#comment-14177058 ] William Palmer commented on TIKA-1302: -- I have left the British Library (as of 20th October 2014). Please contact maureen.penn...@bl.uk if you need to contact someone. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623 ] William Palmer commented on TIKA-1232: -- I am currently out of the office and will be back on Thursday 26th June 2014. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001532#comment-14001532 ] William Palmer commented on TIKA-1302: -- This one might be worth a look - https://github.com/openplanets/format-corpus - Some of the files there are (intentionally) broken, and some are there as examples of format features (i.e. PDF with password, embedded fonts etc) If the license is not clear enough for any files then please raise an issue, sure people will be glad to help. Unfortunately I can't share any of the web content I describe using in that blog post. Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930114#comment-13930114 ] William Palmer commented on TIKA-1232: -- Thanks Tim everyone Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919863#comment-13919863 ] William Palmer commented on TIKA-1232: -- I am currently out of the office and will be back on Monday 11th March 2014. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908291#comment-13908291 ] William Palmer commented on TIKA-1232: -- Hi Tim Andy, Thanks - your code works on my test files. One question though - it appears that dc:format should be a mimetype, therefore should the Extended-Content-Type dc:format be an actual mimetype with version like application/pdf; version=A-1a, with A-1a overriding the pdf:PDFVersion 1.4? Thanks for this - much appreciated! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-456) Support timeouts for parsers
[ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900333#comment-13900333 ] William Palmer commented on TIKA-456: - I've written a proof of concept TimeoutParser here: https://github.com/willp-bl/nanite/blob/master/nanite-hadoop/src/main/java/uk/bl/wap/hadoop/profiler/TimeoutParser.java This times out correctly when using Tika 1.4 and the corrupt mp3 from TIKA-1179. Support timeouts for parsers Key: TIKA-456 URL: https://issues.apache.org/jira/browse/TIKA-456 Project: Tika Issue Type: Improvement Components: parser Reporter: Ken Krugler Assignee: Chris A. Mattmann There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl. One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this: parser = new AutoDetectParser(); CallableParsedDatum c = new TikaCallable(parser, contenthandler, inputstream, metadata); FutureTaskParsedDatum task = new FutureTaskParsedDatum(c); Thread t = new Thread(task); t.start(); ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS); And TikaCallable() looks like: class TikaCallable implements CallableParsedDatum { public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata) { _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... } public ParsedDatum call() throws Exception { _parser.parse(_input, _handler, _metadata, new ParseContext()); } } This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang. One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like: Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS); Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get(). One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-456) Support timeouts for parsers
[ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900360#comment-13900360 ] William Palmer commented on TIKA-456: - AFAIK it's still deprecated. If a library used by Tika hangs, or Tika itself hangs, that causes my code to hang and this is less desirable than using Thread.stop(). If there is a non-deprecated way to stop a thread I will gladly use that approach. I am using Tika with Hadoop to parse files in web archives (93 million files) and a hang causes the map to fail which then causes the job to fail. Support timeouts for parsers Key: TIKA-456 URL: https://issues.apache.org/jira/browse/TIKA-456 Project: Tika Issue Type: Improvement Components: parser Reporter: Ken Krugler Assignee: Chris A. Mattmann There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl. One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this: parser = new AutoDetectParser(); CallableParsedDatum c = new TikaCallable(parser, contenthandler, inputstream, metadata); FutureTaskParsedDatum task = new FutureTaskParsedDatum(c); Thread t = new Thread(task); t.start(); ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS); And TikaCallable() looks like: class TikaCallable implements CallableParsedDatum { public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata) { _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... } public ParsedDatum call() throws Exception { _parser.parse(_input, _handler, _metadata, new ParseContext()); } } This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang. One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like: Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS); Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get(). One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1232) Add PDF version to PDFParser output
William Palmer created TIKA-1232: Summary: Add PDF version to PDFParser output Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)