[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481555#comment-14481555
 ] 

Konstantin Gribov commented on TIKA-1330:
-

[~talli...@mitre.org], you have mixed CRLF/LF in tika-batch, in pom.xml at 
least. Is it OK if I fix it or you're working on this module now?

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481614#comment-14481614
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #607 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/607/])
tika-batch cosmetics

TIKA-1330 (grossws: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1671627)
* /tika/trunk/tika-batch/pom.xml


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481902#comment-14481902
 ] 

Tim Allison commented on TIKA-1330:
---

Thank you, [~grossws]!

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481917#comment-14481917
 ] 

Konstantin Gribov commented on TIKA-1330:
-

That was just a test to check that INFRA-9355 is fixed.

If you aren't working on tika-batch now, I can do simple stylecheck (for 
indent, CRLF/LF consistency). But if you are working on it now, I'll just put 
it into todo list to avoid heavy merges in work-in-progress code.

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482242#comment-14482242
 ] 

Tim Allison commented on TIKA-1330:
---

Thank you for the offer!  I'm going to turn to some other tasks for now so 
please fix whatever you'd like.  I'm happy to make the fixes, too.

I'm sorry that CRs snuck into the code.  I need to check the settings in 
Intellij, again.

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482241#comment-14482241
 ] 

Tim Allison commented on TIKA-1330:
---

Thank you for the offer!  I'm going to turn to some other tasks for now so 
please fix whatever you'd like.  I'm happy to make the fixes, too.

I'm sorry that CRs snuck into the code.  I need to check the settings in 
Intellij, again.

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Tyler Palsulich
All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392027#comment-14392027
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/599/])
TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670804)
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.
On the duplication/triplication of 
 INFO - about to start driver
 INFO - about to start driver

was because main() adds a new appender with .configure()...so subsequent calls 
to main() in the tests were adding more appenders.  I just fixed that in 
r1670804.

What I can't figure out is why you're seeing anything.  I've redirected both 
stdout and stderr in setup() (annotated @Before) to ByteArrayOutputStreams.

If setup() weren't being called, you'd get NPEs for each of the four tests, so 
setup() must be getting calledh

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Wednesday, April 01, 2015 7:39 PM
To: dev@tika.apache.org
Subject: Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391255#comment-14391255
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #595 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/595/])
TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670749)
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j_batch_process.properties
* 
/tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* /tika/trunk/tika-batch/src/test/resources/log4j.properties
* /tika/trunk/tika-batch/src/test/resources/log4j_process.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.
This looks like a Hudson hiccup.

Tyler is seeing excessive logging:
Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
INFO - about to start driver
INFO - about to start driver

Anyone else having problems building from a fresh trunk?


-Original Message-
From: Hudson (JIRA) [mailto:j...@apache.org] 
Sent: Wednesday, April 01, 2015 5:36 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387394#comment-14387394
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #588 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/588/])
TIKA-1330, trivial fixes to avoid NPE with consumersManagerMaxMillis parameter 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670185)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java
* 
/tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387864#comment-14387864
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #589 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/589/])
TIKA-1330: add integration tests to TikaCLITest (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670237)
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java
* 
/tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml
* 
/tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml
* /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml
* /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380479#comment-14380479
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #571 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/571/])
TIKA-1330 clean up logging and some dependencies. Still some log4j dependencies 
for now (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669187)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-batch/pom.xml
* /tika/trunk/tika-batch/src/main/examples
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/strawman/StrawManTikaAppDriver.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-03-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376175#comment-14376175
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #566 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/566/])
initial commit of TIKA-1330 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1668673)
* /tika/trunk/CHANGES.txt
* /tika/trunk/pom.xml
* /tika/trunk/tika-app/pom.xml
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j.properties
* 
/tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchCommandLineTest.java
* /tika/trunk/tika-batch
* /tika/trunk/tika-batch/pom.xml
* /tika/trunk/tika-batch/src
* /tika/trunk/tika-batch/src/main
* /tika/trunk/tika-batch/src/main/examples
* /tika/trunk/tika-batch/src/main/examples/batchExecutor.sh
* /tika/trunk/tika-batch/src/main/examples/log4j.xml
* /tika/trunk/tika-batch/src/main/examples/log4j_driver.xml
* /tika/trunk/tika-batch/src/main/java
* /tika/trunk/tika-batch/src/main/java/org
* /tika/trunk/tika-batch/src/main/java/org/apache
* /tika/trunk/tika-batch/src/main/java/org/apache/tika
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchNoRestartError.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ConsumersManager.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileConsumerFutureResult.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResource.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawlerFutureResult.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileStarted.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/InterrupterFutureResult.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/OutputStreamFactory.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/PoisonFileResource.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporterFutureResult.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/AbstractConsumersBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/BatchProcessBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/CommandLineParserBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/DefaultContentHandlerFactoryBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IContentHandlerFactoryBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ICrawlerBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMAndQueueBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ReporterBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSConsumersManager.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDirectoryCrawler.java
* 

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-03-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348047#comment-14348047
 ] 

Tim Allison commented on TIKA-1330:
---

Posted patch to review board [31758|https://reviews.apache.org/r/31758/]

Some more work is needed, but this is ready for thumbs up/thumbs down and any 
and all review.

Depending on feedback, I'd like to merge this into trunk over the next week or 
two.

Simplest way to run it is from tika-app:

java -jar tika-app---.jar input-dir output-dir

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-11-10 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205290#comment-14205290
 ] 

Tim Allison commented on TIKA-1330:
---

Added preliminary integration into tika-app on github 
[fork|https://github.com/tballison/tika/tree/TIKA-1302] (branch TIKA-1302).

minimal call:
java -jar tika-app.jar input-directory

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147922#comment-14147922
 ] 

Tim Allison commented on TIKA-1330:
---

[~tilman], I leave it as an exercise to implement a FileResourceConsumer that 
uses pure PDFBox. ;) 

Seriously, though, I plan to add something like that in the tika examples 
module (at some point down the road), and all feedback is welcome.

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121454#comment-14121454
 ] 

Tim Allison commented on TIKA-1330:
---

Started documentation on the [wiki|https://wiki.apache.org/tika/TikaBatch].  
Any and all feedback is welcomed.

Will post patch to rb (if possible) or to this issue some time next week.


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison

 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119745#comment-14119745
 ] 

Tim Allison commented on TIKA-1330:
---

Looks like ballpark estimate on time for processing on TIKA-1302 was about 
right.  I just finished a complete run of govdocs1 (~1 million files) on an 8 
cpu vm with 8 gb available, -Xmx4g.  The run used 15 consumers and completed in 
about 4 hours.  The driver restarted the process thirteen times (6 permanent 
hangs and 7 OOM).

 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison

 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)