Jenkins build is back to normal : Tika-trunk #987

2013-02-22 Thread Apache Jenkins Server
See 



[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584754#comment-13584754
 ] 

Luis Filipe Nassif commented on TIKA-1074:
--

I do not like this improvement. Currently I am setting ParseContext with a 
custom AutoDetectParser that, when an exception is hit, e.g. visiting an 
embedded, catches the exception, logs it AND extracts raw/binary strings from 
the problematic doc (or embedded). My app needs to extract text even from 
corrupt documents. With this "improvement" I will not know the problematic 
embedded when it is the best time to do something with it. I prefer to receive 
the exception when it is thrown and work around myself.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Build failed in Jenkins: Tika-trunk #986

2013-02-22 Thread Michael McCandless
Hmmm:

ERROR: Maven JVM terminated unexpectedly with exit code 143

I think that means the JVM was killed with SIGTERM.

I'll kick off a new build

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 22, 2013 at 3:30 PM, Apache Jenkins Server
 wrote:
> See 
>
> Changes:
>
> [mikemccand] TIKA-1074: remove future proofing for InterruptedException
>
> --
> [...truncated 415 lines...]
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
> Running org.apache.tika.parser.asm.ClassParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
> Running org.apache.tika.parser.pdf.PDFParserTest
> Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.1 sec
> Running org.apache.tika.parser.hdf.HDFParserTest
>  WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
> *refno=53 tag= VG (1965) Vgroup length=34 class= Dim0.0 name= Longitude using 
> data 52
>  WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
> *refno=55 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= Latitude using 
> data 54
>  WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
> *refno=57 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim2 using 
> data 56
>  WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
> *refno=59 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim3 using 
> data 58
>  WARN [main] (H4header.java:832) - data tag missing vgroup= 70 Sea Surface 
> Temperature
>  WARN [main] (H4header.java:832) - data tag missing vgroup= 73 Number of 
> Observations per Bin
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.13 sec
> Running org.apache.tika.parser.font.AdobeFontMetricParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
> Running org.apache.tika.embedder.ExternalEmbedderTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
> Running org.apache.tika.detect.TestContainerAwareDetector
> Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.005 sec
> Running org.apache.tika.TestParsers
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.242 sec
> Running org.apache.tika.mime.MimeTypeTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
> Running org.apache.tika.mime.TestMimeTypes
> Tests run: 39, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.112 sec
> Running org.apache.tika.mime.MimeTypesTest
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
>
> Results :
>
> Tests run: 441, Failures: 0, Errors: 0, Skipped: 0
>
> [JENKINS] Recording test results
> [INFO]
> [INFO] --- maven-bundle-plugin:2.3.4:bundle (default-bundle) @ tika-parsers 
> ---
> [INFO]
> [INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ 
> tika-parsers ---
> [INFO] [INFO] Exclude: src/main/java/org/apache/tika/parser/txt/Charset*.java
> [INFO] Exclude: src/test/resources/test-documents/**
>
> [INFO] --- apache-rat-plugin:0.7:check (default) @ tika-parsers ---
> [PMD] No report found for mojo check
> [INFO] [INFO] Installing 
> 
>  to 
> /home/jenkins/jenkins-slave/maven-repositories/0/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-SNAPSHOT.jar
>
> [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ 
> tika-parsers ---
> [INFO] Installing 
>  to 
> /home/jenkins/jenkins-slave/maven-repositories/0/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-SNAPSHOT.pom
> [INFO] Local OBR update disabled (enable with -DobrRepository)
> [INFO]
> [INFO] --- maven-bundle-plugin:2.3.4:install (default-install) @ tika-parsers 
> ---
> Downloading: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/maven-metadata.xml
> [INFO]
> [INFO] --- maven-deploy-plugin:2.6:deploy (default-deploy) @ tika-parsers ---
> Downloaded: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/maven-metadata.xml
>  (774 B at 2.8 KB/sec)
> Uploading: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.jar
> Uploaded: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.jar
>  (498 KB at 732.8 KB/sec)
> Uploading: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.pom
> Uploaded: 
> https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-

Build failed in Jenkins: Tika-trunk #986

2013-02-22 Thread Apache Jenkins Server
See 

Changes:

[mikemccand] TIKA-1074: remove future proofing for InterruptedException

--
[...truncated 415 lines...]
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running org.apache.tika.parser.asm.ClassParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.apache.tika.parser.pdf.PDFParserTest
Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.1 sec
Running org.apache.tika.parser.hdf.HDFParserTest
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=53 tag= VG (1965) Vgroup length=34 class= Dim0.0 name= Longitude using 
data 52
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=55 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= Latitude using 
data 54
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=57 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim2 using 
data 56
 WARN [main] (H4header.java:392) - **dimension length=0 for TagVGroup= 
*refno=59 tag= VG (1965) Vgroup length=33 class= Dim0.0 name= fakeDim3 using 
data 58
 WARN [main] (H4header.java:832) - data tag missing vgroup= 70 Sea Surface 
Temperature
 WARN [main] (H4header.java:832) - data tag missing vgroup= 73 Number of 
Observations per Bin
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.13 sec
Running org.apache.tika.parser.font.AdobeFontMetricParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.apache.tika.embedder.ExternalEmbedderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.apache.tika.detect.TestContainerAwareDetector
Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.005 sec
Running org.apache.tika.TestParsers
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.242 sec
Running org.apache.tika.mime.MimeTypeTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.apache.tika.mime.TestMimeTypes
Tests run: 39, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.112 sec
Running org.apache.tika.mime.MimeTypesTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec

Results :

Tests run: 441, Failures: 0, Errors: 0, Skipped: 0

[JENKINS] Recording test results
[INFO] 
[INFO] --- maven-bundle-plugin:2.3.4:bundle (default-bundle) @ tika-parsers ---
[INFO] 
[INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ 
tika-parsers ---
[INFO] [INFO] Exclude: src/main/java/org/apache/tika/parser/txt/Charset*.java
[INFO] Exclude: src/test/resources/test-documents/**

[INFO] --- apache-rat-plugin:0.7:check (default) @ tika-parsers ---
[PMD] No report found for mojo check
[INFO] [INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-SNAPSHOT.jar

[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika-parsers 
---
[INFO] Installing 
 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-SNAPSHOT.pom
[INFO] Local OBR update disabled (enable with -DobrRepository)
[INFO] 
[INFO] --- maven-bundle-plugin:2.3.4:install (default-install) @ tika-parsers 
---
Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/maven-metadata.xml
[INFO] 
[INFO] --- maven-deploy-plugin:2.6:deploy (default-deploy) @ tika-parsers ---
Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/maven-metadata.xml
 (774 B at 2.8 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.jar
 (498 KB at 732.8 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.pom
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/tika-parsers-1.4-20130222.202947-17.pom
 (9 KB at 22.9 KB/sec)
Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/maven-metadata.xml
Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/maven-metadata.xml
 (343 B at 1.8 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/tika/tika-parsers/1.4-SNAPSHOT/maven-metada

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584362#comment-13584362
 ] 

Michael McCandless commented on TIKA-1074:
--

OK I'll remove the future proofing.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package

2013-02-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584310#comment-13584310
 ] 

Nick Burch commented on TIKA-1086:
--

Could you suggest a fix for this? Does it need to happen in the Maven task that 
bundles it up, or elsewhere? 

Is there an easy way to see the problem? Any chance of a failing unit test? :)

> Tika-bundle 1.3 does not import org.w3c.dom package
> ---
>
> Key: TIKA-1086
> URL: https://issues.apache.org/jira/browse/TIKA-1086
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Gaurav
> Fix For: 1.2
>
>
> The tika-bundle 1.3 version does not import org.w3c.dom package, as a result 
> it is not able to parse DOM based documents such as Microsoft Word (docx) 
> documents.
> This issue does not have in version 1.2 as it does import the necessary 
> package and therefore the parsing of the documents work fine.
> Can someone please look into the issue, as Microsoft Word is a very popular 
> document.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1088) Unsupported AutoCAD drawing version: AC1009

2013-02-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584309#comment-13584309
 ] 

Nick Burch commented on TIKA-1088:
--

Do you know what metadata is stored in the file?

Ideally, would you be able to create a DWG file in this older format with 
exactly the same metadata as we have in our current test DWG files? (See 
DWGParserTest for what those are)

> Unsupported AutoCAD drawing version: AC1009
> ---
>
> Key: TIKA-1088
> URL: https://issues.apache.org/jira/browse/TIKA-1088
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Hardik Upadhyay
> Attachments: 227051.dwg
>
>
> Tika parser version 1.2 and 1.3 fails to parse DWG file version AC1009.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584294#comment-13584294
 ] 

Jukka Zitting commented on TIKA-1074:
-

bq. Wait, do you mean I should remove the handling entirely (not bother future 
proofing)?

If POI decides to declare IE (or just generic Exception) as thrown by their 
API, it'll break binary compatibility, and thus in any case we'll need to 
adjust our code. So adding future proofing code here doesn't win us anything, 
it just complicates the codebase for no gain.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584249#comment-13584249
 ] 

Michael McCandless commented on TIKA-1074:
--

{quote}
bq. InterruptedException is never thrown in these places today, so I can't add 
the separate catch clause (compiler is angry).

It's a checked exception, so if it isn't declared to be thrown by POI, it 
shouldn't get thrown here (even though the VM doesn't strictly prohibit that).
{quote}

Exactly: I'm trying to future proof.

bq. So in that case the extra check shouldn't even be needed.

Wait, do you mean I should remove the handling entirely (not bother future 
proofing)?

{quote}
bq. I think it's cleaner to set the interrupt bit and let the next place that 
waits see the interrupt bit and throw IE?

I don't really like this approach. We're essentially saying: "Yes, you asked me 
to stop what I'm doing, but instead I'll just finish up what I was doing and 
ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to 
stop immediately, either by letting the IE bubble up or (where necessary) by 
throwing a TikaException that wraps the IE.
{quote}

OK, maybe we can throw TikaException today (*and* set the interrupt
bit), and then in the future (if/when these places really do throw
IE), we can change this to throwing a IE instead of TikaException.  I
can put that as a TODO.


> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584229#comment-13584229
 ] 

Jukka Zitting commented on TIKA-1074:
-

bq. InterruptedException is never thrown in these places today, so I can't add 
the separate catch clause (compiler is angry).

It's a checked exception, so if it isn't declared to be thrown by POI, it 
shouldn't get thrown here (even though the VM doesn't strictly prohibit that). 
So in that case the extra check shouldn't even be needed.

bq. I think it's cleaner to set the interrupt bit and let the next place that 
waits see the interrupt bit and throw IE?

I don't really like this approach. We're essentially saying: "Yes, you asked me 
to stop what I'm doing, but instead I'll just finish up what I was doing and 
ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to 
stop immediately, either by letting the IE bubble up or (where necessary) by 
throwing a TikaException that wraps the IE.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1088) Unsupported AutoCAD drawing version: AC1009

2013-02-22 Thread Hardik Upadhyay (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hardik Upadhyay updated TIKA-1088:
--

Attachment: 227051.dwg

> Unsupported AutoCAD drawing version: AC1009
> ---
>
> Key: TIKA-1088
> URL: https://issues.apache.org/jira/browse/TIKA-1088
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Hardik Upadhyay
> Attachments: 227051.dwg
>
>
> Tika parser version 1.2 and 1.3 fails to parse DWG file version AC1009.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1088) Unsupported AutoCAD drawing version: AC1009

2013-02-22 Thread Hardik Upadhyay (JIRA)
Hardik Upadhyay created TIKA-1088:
-

 Summary: Unsupported AutoCAD drawing version: AC1009
 Key: TIKA-1088
 URL: https://issues.apache.org/jira/browse/TIKA-1088
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Hardik Upadhyay


Tika parser version 1.2 and 1.3 fails to parse DWG file version AC1009.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584194#comment-13584194
 ] 

Ray Gauss II commented on TIKA-1074:


bq. But it's a little weird throw TikaExc in response to an interrupt (ie, code 
above will be trying to catch an IE) ... I think it's cleaner to set the 
interrupt bit and let the next place that waits see the interrupt bit and throw 
IE?

That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584176#comment-13584176
 ] 

Michael McCandless commented on TIKA-1074:
--

Thanks Jukka.

InterruptedException is never thrown in these places today, so I can't add the 
separate catch clause (compiler is angry).

So, the instanceof check for IE is in case in the future we do handle 
interrupts in these places ... we could just remove it and add it back in the 
future if we add IE (seems risky).

Or I can change that code to throw TikaException instead on interrupt (and 
restore the interrupt bit), except in the TikaCLI case, 
EmbeddedDocumentExtractor.parseEmbedded doesn't throw TikaException today (the 
other two places already do).  But it's a little weird throw TikaExc in 
response to an interrupt (ie, code above will be trying to catch an IE) ... I 
think it's cleaner to set the interrupt bit and let the next place that waits 
see the interrupt bit and throw IE?

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira