[jira] [Created] (TIKA-1309) RTF TextExtractor can ignore consecutive linebreaks

2014-05-23 Thread Aleksandr Dubinsky (JIRA)
Aleksandr Dubinsky created TIKA-1309:


 Summary: RTF TextExtractor can ignore consecutive linebreaks
 Key: TIKA-1309
 URL: https://issues.apache.org/jira/browse/TIKA-1309
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
Reporter: Aleksandr Dubinsky


Some RTF files encode consecutive linebreaks as simply consecutive \par 
commands. However, org.apache.tika.parser.rtf.TextExtractor ignores the second 
\par.

Solution is to replace at line 1158:

} else if (equals("par")) {
if (!ignored) {
endParagraph(true);
}
}

with:


} else if (equals("par")) {
if (!ignored) {
lazyStartParagraph();
endParagraph(true);
}
}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007717#comment-14007717
 ] 

Tim Allison edited comment on TIKA-1294 at 5/23/14 9:19 PM:


I investigated a bit more and sent a question to pdfbox users list.  It looks 
like the memory consumption profile is far better in PDFBox 2.0 (constant 
130m), but I was getting errors when I tried to view the exported files.  With 
PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 
2,750 embedded images (2.6 GB) when they were fully extracted.

Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the 
default behavior to not extract PDXObjectImages.  I figure if I found this 
problem in my small personal test set and in the first 500 govdocs test, this 
may be a fairly common issue.

Users who just want text and/or metadata will face a decent sized increase in 
OOM Exceptions if we leave this on as default. [~jukkaz], I won't turn off the 
feature you added, though, without your consent! 

I'd also prefer to allow users to turn this on/off via config file so that 
non-dev folks who are using Tika don't have to add their own DocumentSelector.

Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added 
some metadata that will allow consumers who want to use a DocumentSelector to 
tell what type of embedded object they're looking at.

Any and all feedback is welcome.  I'm not held to the decisions I made in this 
patch.

 


was (Author: talli...@mitre.org):
I investigated a bit more and sent a question to pdfbox users list.  It looks 
like the memory consumption profile is far better in PDFBox 2.0 (constant 
130m), but I was getting errors when I tried to view the exported files.  With 
PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 
2,750 embedded images (2.6 GB) when they were fully extracted.

Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the 
default behavior to not extract PDXObjectImages.  I figure if I found this 
problem in my small personal test set and in the first 500 govdocs test, this 
may be a fairly common issue.

Users who just want text and/or metadata will face a decent sized increase in 
OOM Exceptions if we leave this on as default. [~jukkaz], I won't want to turn 
off the feature you added, though, without your consent! 

I'd also prefer to allow users to turn this on/off via config file so that 
non-dev folks who are using Tika don't have to add their own DocumentSelector.

Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added 
some metadata that will allow consumers who want to use a DocumentSelector to 
tell what type of embedded object they're looking at.

Any and all feedback is welcome.  I'm not held to the decisions I made in this 
patch.

 

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1294:
--

Attachment: TIKA-1294v1.patch

I investigated a bit more and sent a question to pdfbox users list.  It looks 
like the memory consumption profile is far better in PDFBox 2.0 (constant 
130m), but I was getting errors when I tried to view the exported files.  With 
PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 
2,750 embedded images (2.6 GB) when they were fully extracted.

Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the 
default behavior to not extract PDXObjectImages.  I figure if I found this 
problem in my small personal test set and in the first 500 govdocs test, this 
may be a fairly common issue.

Users who just want text and/or metadata will face a decent sized increase in 
OOM Exceptions if we leave this on as default. [~jukkaz], I won't want to turn 
off the feature you added, though, without your consent! 

I'd also prefer to allow users to turn this on/off via config file so that 
non-dev folks who are using Tika don't have to add their own DocumentSelector.

Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added 
some metadata that will allow consumers who want to use a DocumentSelector to 
tell what type of embedded object they're looking at.

Any and all feedback is welcome.  I'm not held to the decisions I made in this 
patch.

 

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007348#comment-14007348
 ] 

Tim Allison edited comment on TIKA-1302 at 5/23/14 4:52 PM:


Y, that's an important question.  All depends on size of corpus and what we 
want for processing time.

Let's assume we start with govdocs1 or a sample of it.

Complete back of envelope...

On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 
seconds to index 1000 files from govdocs1 (let's assume the time to index is 
roughly equivalent to the time it'll take to write out the diagnostic stuff 
we'll want to record for each file).

That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour 
and 1M files in 11 hours.

So, if wanted to start small, we could start with 100k.  The full govdocs1 
takes up 470GB.  A 100k sample would take up roughly 47GB.

We'd want probably (ballpark) 10x input corpus size to store the output so that 
we can compare different versions of Tika.  So, 0.5 TB.  Let's double that for 
some growth: 1 TB.

So, with a modest 4 cores, let's say 4 GB RAM, and 1 TB of storage, we could 
run Tika against 100k files in a bit more than an hour.  Add another few 
minutes to compare output for comparison statistics.

***These numbers are based on a purely in-memory run.  We'll probably want to 
run against a server (not the public one, of course) so that'll add some to the 
time.

Do these numbers jibe with what others are experiencing?

The big gotcha, of course, is that we'll want to harden the server and/or 
create a server daemon to restart the server(s) for OOM and infinite hangs.  
But I think those features are badly needed and this project will give good 
motivation for these improvements.




was (Author: talli...@mitre.org):
Y, that's an important question.  All depends on size of corpus and what we 
want for processing time.

Let's assume we start with govdocs1 or a sample of it.

Complete back of envelope...

On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 
seconds to index 1000 files from govdocs1 (let's assume the time to index is 
roughly equivalent to the time it'll take to write out the diagnostic stuff 
we'll want to record for each file).

That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour 
and 1M files in 11 hours.

So, if wanted to start small, we could start with 100k.  The full govdocs1 
takes up 470GB.  A 100k sample would take up roughly 47GB.

We'd want probably (ballpark) 10x input corpus size to store the output so that 
we can compare different versions of Tika.  So, 0.5 TB.  Let's double that for 
some growth: 1 TB.

So, with a modest 4 cores and 1 TB of storage, we could run Tika against 100k 
files in a bit more than an hour.  Add another few minutes to compare output 
for comparison statistics.

***These numbers are based on a purely in-memory run.  We'll probably want to 
run against a server (not the public one, of course) so that'll add some to the 
time.

Do these numbers jibe with what others are experiencing?

The big gotcha, of course, is that we'll want to harden the server and/or 
create a server daemon to restart the server(s) for OOM and infinite hangs.  
But I think those features are badly needed and this project will give good 
motivation for these improvements.



> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007348#comment-14007348
 ] 

Tim Allison commented on TIKA-1302:
---

Y, that's an important question.  All depends on size of corpus and what we 
want for processing time.

Let's assume we start with govdocs1 or a sample of it.

Complete back of envelope...

On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 
seconds to index 1000 files from govdocs1 (let's assume the time to index is 
roughly equivalent to the time it'll take to write out the diagnostic stuff 
we'll want to record for each file).

That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour 
and 1M files in 11 hours.

So, if wanted to start small, we could start with 100k.  The full govdocs1 
takes up 470GB.  A 100k sample would take up roughly 47GB.

We'd want probably (ballpark) 10x input corpus size to store the output so that 
we can compare different versions of Tika.  So, 0.5 TB.  Let's double that for 
some growth: 1 TB.

So, with a modest 4 cores and 1 TB of storage, we could run Tika against 100k 
files in a bit more than an hour.  Add another few minutes to compare output 
for comparison statistics.

***These numbers are based on a purely in-memory run.  We'll probably want to 
run against a server (not the public one, of course) so that'll add some to the 
time.

Do these numbers jibe with what others are experiencing?

The big gotcha, of course, is that we'll want to harden the server and/or 
create a server daemon to restart the server(s) for OOM and infinite hangs.  
But I think those features are badly needed and this project will give good 
motivation for these improvements.



> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007294#comment-14007294
 ] 

Nick Burch commented on TIKA-1308:
--

Note that some parsers will always require a File to work, due to the API 
contract of the upstream library used in parsing

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: yuanyun.cn
>  Labels: gae
> Fix For: 1.6
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> This fails with exception:
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Hello

2014-05-23 Thread Mattmann, Chris A (3980)
Welcome buddy thanks for sending this!

Welcome to the list, Tyler!

Cheers,
Chris




-Original Message-
From: Tyler Palsulich 
Reply-To: "dev@tika.apache.org" 
Date: Friday, May 23, 2014 9:06 AM
To: "dev@tika.apache.org" 
Subject: Hello

>Hi All,
>
>My name is Tyler Palsulich. I am a computer science master's student at
>NYU
>who will be working with Chris Mattmann this summer. I'll be working
>on the DARPA
>XDATA  project at
>JPL, using/adapting Tika to handle various Twitter and/or Bitcoin data.
>I'm
>excited to start contributing to Tika!
>
>Any development tips I should know?
>
>Thanks,
>Tyler



Hello

2014-05-23 Thread Tyler Palsulich
Hi All,

My name is Tyler Palsulich. I am a computer science master's student at NYU
who will be working with Chris Mattmann this summer. I'll be working
on the DARPA
XDATA  project at
JPL, using/adapting Tika to handle various Twitter and/or Bitcoin data. I'm
excited to start contributing to Tika!

Any development tips I should know?

Thanks,
Tyler


[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007292#comment-14007292
 ] 

Chris A. Mattmann commented on TIKA-1308:
-

Sounds like a great issue - looking forward to seeing a patch. Thanks!

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: yuanyun.cn
>  Labels: gae
> Fix For: 1.6
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> This fails with exception:
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-23 Thread yuanyun.cn (JIRA)
yuanyun.cn created TIKA-1308:


 Summary: Support in memory parse mode(don't create temp file): to 
support run Tika in GAE
 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: yuanyun.cn
 Fix For: 1.6


I am trying to use Tika in GAE and write a simple servlet to extract meta data 
info from jpeg:

String urlStr = req.getParameter("imageUrl");
byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));

ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
Metadata metadata = new Metadata();
BodyContentHandler ch = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(bais, ch, metadata, new ParseContext());
bais.close();

This fails with exception:
Caused by: java.lang.SecurityException: Unable to create temporary file
at java.io.File.createTempFile(File.java:1986)
at 
org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242

Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, 
ContentHandler, Metadata, ParseContext), it creates a temp file from the input 
stream.

I can understand why tika create temp file from the stream: so tika can parse 
it multiple times.

But as GAE and other cloud servers are getting more popular, is it possible to 
avoid create temp file: instead we can copy the origin stream to a byteArray 
stream, so tika can also parse it multiple times.
-- This will have a limit on the file size, as tika keeps the whole file in 
memory, but this can make tika work in GAE and maybe other cloud server.

We can add a parameter in parser.parse to indicate whether do in memory parse 
only.
 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007199#comment-14007199
 ] 

Chris A. Mattmann commented on TIKA-1302:
-

[~talli...@apache.org] this is a good question -- the VM that lewis set up I 
believe is so that anyone can try out Tika via the JAX-RS service. I would 
imagine if we do the large batch of docs nightly test (which I think would be 
awesome, btw) we'll need to figure out the specs we would need and then compare 
it to the VM that lewis just had set up. How much RAM, CPU, disk etc do you 
think we'll need Tim?

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007132#comment-14007132
 ] 

Tim Allison commented on TIKA-1302:
---

[~chrismattmann], [~gagravarr], [~lewismc] and All,
  Would it be ok to start trying to work on this on the vm that Lewis just had 
set up for TIKA-1301?  I figure we can take baby steps on that and  if this 
kind of process turns out to be useful to the community and we need more 
resources then we can set up a separate vm.

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)