[jira] [Commented] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'

2014-10-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188192#comment-14188192
 ] 

Nick Burch commented on TIKA-1460:
--

We do really ideally need the problematic file, any chance you could have 
another try at attaching it to this jira?

Also, you've said you're using Tika 1.3, can you try 1.6?

> Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
> --
>
> Key: TIKA-1460
> URL: https://issues.apache.org/jira/browse/TIKA-1460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: win7,myeclipse8.5
>Reporter: onyas
>Priority: Critical
>
> for some reason,I could not upload the file,Here is the info..
> and i checked all the version in the directory of 
> \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@d640af
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of 
> the file
>   at 
> org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:202)
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:184)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 21 more
> the major code is :
> Parser parser = new AutoDetectParser();
>   ContentHandler handler = new BodyContentHandler(getNum());
>   Metadata metadata = new Metadata();
>   ParseContext context = new ParseContext();
>   InputStream stream = null;
>   StringBuffer content = new StringBuffer();
>   try {
>   stream = new FileInputStream(file);
>   if (stream != null) {
>   parser.parse(stream, handler, metadata, 
> context);
>   content = content.append(handler);
>   
>   if(StringUtils.isNotBlank(content.toString())){
>   hasContent = true;
>   handler = null;
>   metadata = null;
>   context = null;
>   }
>   }
> And the exception is throwed at this line== parser.parse(stream, handler, 
> metadata, context);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Cservenak, Tamas (JIRA)
Cservenak, Tamas created TIKA-1461:
--

 Summary: Bad mime detection of certain JAR file
 Key: TIKA-1461
 URL: https://issues.apache.org/jira/browse/TIKA-1461
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.6
Reporter: Cservenak, Tamas


Given this "ordinary" Java JAR file
https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar

Manually inspected and tested it, it is a Jar file and is valid one.

Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
present, still it's not detected as desired {{application/java-archive}}.

IMO, this happens due to the problem with priority of 
{{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
"mediation" would kick in, see TIKA-1292. 

Changing/overriding magic priority is not possible using 
{{custom-mimetypes.xml}} is also not possible.

Unsure what the correct solution is here, nor how to circumvent this without 
patching Tika.

The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Cservenak, Tamas (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188216#comment-14188216
 ] 

Cservenak, Tamas commented on TIKA-1461:


Seems this change in {{custom-mimetypes.xml}} solves the problem:

{noformat}
  



  
  
  

  
{noformat}

This change basically adds a magic to {{application/java-archive}} (that 
basically has none, it inherits from {{application/zip}}) with priority of 55. 
Hence, later hinting done in TIKA-1292 selects {{application/java-archive}} 
over {{application/x-msdownload;format=pe}}.


> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-29 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: TIKA-1445_tallison_v3_20141027.patch

This version subclasses Parser to create an ImageMetaParser class, which our 
current image metadata parsers then extend.

This adds a DefaultImageMetadataparser that is a copy and paste of 
DefaultParser...can't override static loader unfortunately!

We now specify regular parsers in the Parser services file and 
ImageMetadataParsers in a separate services file.

I don't like that this creates a new "class" of parsers, but I can't think of 
another way of guaranteeing that the OCRParser will find an image metadata 
parser correctly.

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Cservenak, Tamas (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188455#comment-14188455
 ] 

Cservenak, Tamas commented on TIKA-1461:


Scratch that above, it made all ZIPs appear as JARs. Fix was instead to re-add 
ZIPs with higher priority magic (basically copied the entry for Tika's XML to 
custom-mimetypes.xml.

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188479#comment-14188479
 ] 

Nick Burch commented on TIKA-1461:
--

Do you know the license of that file? And/or of a different jar that is apache 
/ bsd licensed that shows the problem?

Before we go about making changes for this, we really want to add a unit test 
so we can ensure it is properly fixed, but equally importantly remains fixed 
into the future!

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Cservenak, Tamas (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188487#comment-14188487
 ] 

Cservenak, Tamas commented on TIKA-1461:


Don't know of any other JAR that shows the problem.

The POM seems to be licensed as "BSD License", according to parent POM it 
inherits:
https://maven.atlassian.com/content/groups/public/com/atlassian/pom/public-pom/3.0.6/public-pom-3.0.6.pom

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188555#comment-14188555
 ] 

Nick Burch commented on TIKA-1461:
--

I've just tried with a recent snapshot build, and both with and without the 
filename the Tika app is able to correctly detect the type:

$ java -jar tika-app-1.7-SNAPSHOT.jar --detect 
/tmp/support-healthcheck-plugin-1.0.3.jar 
application/java-archive
$ java -jar tika-app-1.7-SNAPSHOT.jar --detect < 
/tmp/support-healthcheck-plugin-1.0.3.jar 
application/java-archive

Any chance you could retest with a recent nightly build / build from svn trunk, 
and see if we've already solved this?

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Cservenak, Tamas (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188779#comment-14188779
 ] 

Cservenak, Tamas commented on TIKA-1461:


Nick, tika-app is known to be OK with detecting this file (see Nexus issue 
comments). But, we use tika-core only, that unlike tika-app, has limited set of 
detectors, and AFAIK relies on tika-mimetypes.xml solely (plus we use 
custom-mimetypes.xml too). Tike-Core alone does produce this problem.

Please see comments on original Nexus issue
https://issues.sonatype.org/browse/NEXUS-7603

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1461.
--
   Resolution: Fixed
Fix Version/s: 1.7

Fixed in r1635263. To be a valid PE file, it needs to start with MZ too. Magic 
tweaked and test files added accordingly, thanks!

> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
> Fix For: 1.7
>
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188892#comment-14188892
 ] 

Hudson commented on TIKA-1461:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #290 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/290/])
TIKA-1461 PE files must also have the MZ header at the start, so tweak magic 
and add positive and negative mime magic detection tests for it (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1635263)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJAR_with_PEHDR.jar
Very small Windows exe for TIKA-1461, generated with Visual Studio 2008 with 
advice from http://www.phreedom.org/research/tinype/ (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1635257)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testTinyPE.exe


> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
> Fix For: 1.7
>
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1461) Bad mime detection of certain JAR file

2014-10-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188912#comment-14188912
 ] 

Hudson commented on TIKA-1461:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #270 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/270/])
TIKA-1461 PE files must also have the MZ header at the start, so tweak magic 
and add positive and negative mime magic detection tests for it (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1635263)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJAR_with_PEHDR.jar
Very small Windows exe for TIKA-1461, generated with Visual Studio 2008 with 
advice from http://www.phreedom.org/research/tinype/ (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1635257)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testTinyPE.exe


> Bad mime detection of certain JAR file
> --
>
> Key: TIKA-1461
> URL: https://issues.apache.org/jira/browse/TIKA-1461
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.6
>Reporter: Cservenak, Tamas
> Fix For: 1.7
>
>
> Given this "ordinary" Java JAR file
> https://maven.atlassian.com/content/groups/public/com/atlassian/support/healthcheck/support-healthcheck-plugin/1.0.3/support-healthcheck-plugin-1.0.3.jar
> Manually inspected and tested it, it is a Jar file and is valid one.
> Still, Tika Core's Detector detects it as type {{application/x-msdownload; 
> format=pe}}. Tthe detection is "hinted" with file name, hence "jar" hint is 
> present, still it's not detected as desired {{application/java-archive}}.
> IMO, this happens due to the problem with priority of 
> {{application/x-msdownload; format=pe}}, which is 55. If it would be 50, the 
> "mediation" would kick in, see TIKA-1292. 
> Changing/overriding magic priority is not possible using 
> {{custom-mimetypes.xml}} is also not possible.
> Unsure what the correct solution is here, nor how to circumvent this without 
> patching Tika.
> The problem affects versions 1.5 but also 1.6, but we target 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


PDF test failing on trunk

2014-10-29 Thread Nick Burch

Hi All

Just tried to build trunk, and got a test failure:

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Unable 
to extract PDF content


Tests run: 547, Failures: 0, Errors: 1, Skipped: 7


The exception in the log is:

Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: 
Input length must be multiple of 16 when decrypting with padded cipher
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)

at javax.crypto.CipherInputStream.read(CipherInputStream.java:236)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:212)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:316)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:421)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:365)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:196)
at 
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
at 
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1595)


Caused by: javax.crypto.IllegalBlockSizeException: Input length must be 
multiple of 16 when decrypting with padded cipher

at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
at 
com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:423)

at javax.crypto.Cipher.doFinal(Cipher.java:1708)
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)



Is anyone else seeing this one?

Nick


[jira] [Created] (TIKA-1462) PDFont consumes all heap space

2014-10-29 Thread James Hardwick (JIRA)
James Hardwick created TIKA-1462:


 Summary: PDFont consumes all heap space
 Key: TIKA-1462
 URL: https://issues.apache.org/jira/browse/TIKA-1462
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: James Hardwick
Priority: Critical


See https://issues.apache.org/jira/browse/PDFBOX-2200 for more details.

In short, PDFont will not release resources, and will eventually amass enough 
objects to consume all available memory. We are encountering this in 
productions environments, causing our solr server to crash when ingesting large 
amounts of PDF documents. 

The fix is supposedly in for the 2.0.0 release of PDFBox, but that version has 
been outstanding for so long that I'd suggest implementing the workaround as 
proposed in the PDFBox issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1462) PDFont consumes all heap space

2014-10-29 Thread James Hardwick (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Hardwick closed TIKA-1462.

Resolution: Duplicate

Looks like this was already handled via TIKA-1424

> PDFont consumes all heap space
> --
>
> Key: TIKA-1462
> URL: https://issues.apache.org/jira/browse/TIKA-1462
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Hardwick
>Priority: Critical
>
> See https://issues.apache.org/jira/browse/PDFBOX-2200 for more details.
> In short, PDFont will not release resources, and will eventually amass enough 
> objects to consume all available memory. We are encountering this in 
> productions environments, causing our solr server to crash when ingesting 
> large amounts of PDF documents. 
> The fix is supposedly in for the 2.0.0 release of PDFBox, but that version 
> has been outstanding for so long that I'd suggest implementing the workaround 
> as proposed in the PDFBox issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)