[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-04 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890560#comment-13890560
 ] 

Markus Jelsma commented on TIKA-1224:
-

A patch seems to be missing here.

> Adding Source code (Java, Groovy, C) parser
> ---
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: 
> http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1229) Hyperlink in .doc page header broken

2014-02-04 Thread Lutz Theurer (JIRA)
Lutz Theurer created TIKA-1229:
--

 Summary: Hyperlink in .doc page header broken
 Key: TIKA-1229
 URL: https://issues.apache.org/jira/browse/TIKA-1229
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Lutz Theurer
Priority: Minor


If you have a hyperlink to a webpage or mailto in the page header (german: 
Kopfzeile) of your .doc document the import is defaced like this:
 �HYPERLINK "http://tw-systemhaus.de"; �http://tw-systemhaus.de�

It's however not an issue in text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1229) Hyperlink in .doc page header broken

2014-02-04 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890564#comment-13890564
 ] 

Nick Burch commented on TIKA-1229:
--

Any chance you could upload a file that shows the issue? Bonus marks if you 
could write a small junit unit test that uses it to show the problem!

> Hyperlink in .doc page header broken
> 
>
> Key: TIKA-1229
> URL: https://issues.apache.org/jira/browse/TIKA-1229
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lutz Theurer
>Priority: Minor
>
> If you have a hyperlink to a webpage or mailto in the page header (german: 
> Kopfzeile) of your .doc document the import is defaced like this:
>  �HYPERLINK "http://tw-systemhaus.de"; �http://tw-systemhaus.de�
> It's however not an issue in text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1229) Hyperlink in .doc page header broken

2014-02-04 Thread Lutz Theurer (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lutz Theurer updated TIKA-1229:
---

Attachment: mail.doc

> Hyperlink in .doc page header broken
> 
>
> Key: TIKA-1229
> URL: https://issues.apache.org/jira/browse/TIKA-1229
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lutz Theurer
>Priority: Minor
> Attachments: mail.doc
>
>
> If you have a hyperlink to a webpage or mailto in the page header (german: 
> Kopfzeile) of your .doc document the import is defaced like this:
>  �HYPERLINK "http://tw-systemhaus.de"; �http://tw-systemhaus.de�
> It's however not an issue in text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890605#comment-13890605
 ] 

Tim Allison commented on TIKA-1228:
---

Not sure I understand.  Is this the snippet that you refer to in PDNameTreeNode:
{noformat}
public Map getNames() throws IOException
{
COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES 
);
{noformat}

The above throws a class cast exception, but the code that you show doesn't?

Are you getting a class cast exception on the document that you submitted with 
this issue or is it a different document?

Thank you, again.

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890607#comment-13890607
 ] 

Jason Sherman commented on TIKA-1228:
-

Tim,

I saw you already added a test and fix to the codebase.  Thanks!  I'm going to 
clone it and use it if you don't mind. 

Jason

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890610#comment-13890610
 ] 

Tim Allison commented on TIKA-1228:
---

Y.  That's the point of open source. :)  Enjoy!

Now that I'm looking at this issue again, I dragged out some of my pre-Tika 
code for pdf attachments using a different pdf library.  It looks like the pdf 
files I was coding against could have the file name in a parent node and the 
actual bytes in a child or more distant descendant node.

Will see if I can dig up the triggering files and see if Tika needs any more 
mods on PDF attachment extraction.

{noformat}
private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment 
attach, int recursiveDepth){

COSName fCOSName = COSName.create("F");
COSName efCOSName = COSName.create("EF");
COSObject fObj = dict.get(fCOSName);
COSObject efObj = dict.get(efCOSName);
if (null != fObj){
if (fObj.getClass() == COSString.class){
attach.setName(fObj.stringValue());
} else if (fObj.getClass() == COSStream.class){
attach.setBytes(((COSStream)fObj).getDecodedBytes());
return attach;
}
} 
if (null != efObj && efObj.getClass() == COSDictionary.class){ 
int tmpI = recursiveDepth;
tmpI++;
return lookForByteStream((COSDictionary)efObj, attach, tmpI);   
}
return null;
}
{noformat}

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890608#comment-13890608
 ] 

Jason Sherman commented on TIKA-1228:
-

Tim,

Dang.  During my troubleshooting, I first updated pdfbox to 1.8.3 and was using 
that source to step through the code.  After the weirdness with the exception 
in code, but not in my expression evaluator, I reverted to the original tika 
code, but failed to revert the pdfbox code.  I apologize for the confusion.  
Thanks again for your fast responses.

Jason

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890613#comment-13890613
 ] 

Tim Allison commented on TIKA-1228:
---

Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue?

Thanks again.

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890639#comment-13890639
 ] 

Jason Sherman edited comment on TIKA-1228 at 2/4/14 1:11 PM:
-

Correct.  PDNameTreeNode class cast exception is a non-issue.


was (Author: agi20dla):
Correct.  PDNameTreeNode clas cast exception is a non-issue.

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890639#comment-13890639
 ] 

Jason Sherman commented on TIKA-1228:
-

Correct.  PDNameTreeNode clas cast exception is a non-issue.

> Embedded files not extracted properly from PDF
> --
>
> Key: TIKA-1228
> URL: https://issues.apache.org/jira/browse/TIKA-1228
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: CentOS 6.5 VM
>Reporter: Jason Sherman
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1230:
-

 Summary: Update PDFBox to v1.8.4
 Key: TIKA-1230
 URL: https://issues.apache.org/jira/browse/TIKA-1230
 Project: Tika
  Issue Type: Improvement
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.5






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1230.
---

Resolution: Fixed

r1564335

> Update PDFBox to v1.8.4
> ---
>
> Key: TIKA-1230
> URL: https://issues.apache.org/jira/browse/TIKA-1230
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.5
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1231) Safely handle null embedded files in PDFs

2014-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1231:
-

 Summary: Safely handle null embedded files in PDFs
 Key: TIKA-1231
 URL: https://issues.apache.org/jira/browse/TIKA-1231
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.5


I filed a potential fix, unit test and test doc for this in PDFBOX-1884.  We'll 
need to add one test for null in the Tika PDFParser to handle this change once 
it is fixed in PDFBox.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Key Revocation

2014-02-04 Thread David Meikle
Hello,
(CC dev@tika.apache.org for info)

I have had to revoke my code signing key due to media failure.

Attached is the revocation key and I am following the steps here:
http://www.apache.org/dev/release-signing.html#revoke-cert

Cheers,
Dave

-BEGIN PGP PUBLIC KEY BLOCK-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: A revocation certificate should follow

iQIfBCABAgAJBQJLrjSDAh0CAAoJEGBlJ+auqMarJnsP/j6E+hQ9vkdvMncXbQXX
9auQeI0tRQDvgoKmQYk9T16QyhXANZoJDzuLEmE/8kMNwr0U5ay3lEV0KsJZe+z8
fnsEmfNoV6xACNwa/DT4V4dQnVvy++K9z8CndX/3QNimduvuKzCZhEbMltwdQSYh
Lr7JUWBerayMD3XR8Jl+bYnU47FapI4pgDNOKbLsdhEPhlEcZUhqBy/8d/p/NjI6
NLzOpBieSbYhYYh5O8wjR0JJw5gtYf//IO7GQYoAGzbGLa9m/MCsNSNU9M3HYQs0
dnBYxS3yGOk+rIRrfb/MR6ySjfSLiNb6IvSrSQ2oaJfjm3NnBYD4z1QcAq+JmfFm
b0Zq7uAEaxrovzkFX3mwNmxeoMTqJP0nr6napU5y7maJYzsLO/tHf+NeIiZIuoK8
Q/GQE1QJW3nQ2/LOWwAyXTLoMR6/IP+HO9Lk6ad9hXPswqPKw6vGTJrMJJS+eyTR
AFoxY8nQ24rkE6uT37hZUPeoZdZ6sGDIJBDWUXTez/U/TgCFrbpoMhMC2oLOWdA0
Qr/vgz6wdb/8x/CUe49eKzdXhIacAI7aYXtLOQwVprARDTUqSrfB9ijwz80BkrL1
OWqgI1iHcsQD0e8eqzVQCndehvLMhezRQ2NSQyZNF0AsBP39OtyK5ATUNrrq6g8L
raF/l9VwsC/63dtfJyvq0sLE
=Q/66
-END PGP PUBLIC KEY BLOCK-



[jira] [Resolved] (TIKA-1229) Hyperlink in .doc page header broken

2014-02-04 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1229.
--

   Resolution: Fixed
Fix Version/s: 1.5

Should be fixed as of r1564540. Ended up being a bit more work than first 
anticipated, as we were processing headers and footers in a very simplistic 
way, which has now been replaced with handling them as proper ranges

> Hyperlink in .doc page header broken
> 
>
> Key: TIKA-1229
> URL: https://issues.apache.org/jira/browse/TIKA-1229
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lutz Theurer
>Priority: Minor
> Fix For: 1.5
>
> Attachments: mail.doc
>
>
> If you have a hyperlink to a webpage or mailto in the page header (german: 
> Kopfzeile) of your .doc document the import is defaced like this:
>  �HYPERLINK "http://tw-systemhaus.de"; �http://tw-systemhaus.de�
> It's however not an issue in text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1086.
---

Resolution: Fixed

Patch added in r1553845.

> Tika-bundle 1.3 does not import org.w3c.dom package
> ---
>
> Key: TIKA-1086
> URL: https://issues.apache.org/jira/browse/TIKA-1086
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Gaurav
> Fix For: 1.5
>
> Attachments: TIKA-1086.svn.diff
>
>
> The tika-bundle 1.3 version does not import org.w3c.dom package, as a result 
> it is not able to parse DOM based documents such as Microsoft Word (docx) 
> documents.
> This issue does not have in version 1.2 as it does import the necessary 
> package and therefore the parsing of the documents work fine.
> Can someone please look into the issue, as Microsoft Word is a very popular 
> document.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-605) Tika GDAL parser

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-605:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Tika GDAL parser
> 
>
> Key: TIKA-605
> URL: https://issues.apache.org/jira/browse/TIKA-605
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gdal, gsoc2013, integration, mentor, tika
> Fix For: 1.6
>
> Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
> TIKA-605.Mattmann.092511.patch.txt
>
>
> Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
> around GDAL. See here: 
> http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1167) Embedded object not extracted

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1167:
--

Fix Version/s: (was: 1.5)
   1.6

Push to 1.6, preparing for 1.5 RC

> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.6
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-539:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
> Fix For: 1.6
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-715:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
> Fix For: 1.6
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.m

[jira] [Updated] (TIKA-776) ExifTool Embedder

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-776:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: embed, exiftool, patch
> Fix For: 1.6
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-819:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.6
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-985) Support for HTML5 elements

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-985:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Support for HTML5 elements
> --
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-995:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> XHTMLContentHandler doesn't pass attributes of body element
> ---
>
> Key: TIKA-995
> URL: https://issues.apache.org/jira/browse/TIKA-995
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch
>
>
> XHTMLContentHandler.startElement() uses lazyHead() for the body element 
> because it's defined in the AUTO Set. As a consequence, attributes of the 
> body element are not passed to downstream content handlers. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-774) ExifTool Parser

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-774:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, newbie, patch,
> Fix For: 1.6
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1059:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
> --
>
> Key: TIKA-1059
> URL: https://issues.apache.org/jira/browse/TIKA-1059
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
> Fix For: 1.6
>
>
> The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
> {{InterruptedException}} and ignore it.
> The methods should either call {{interrupt()}} on the current thread or 
> re-throw the exception, possibly wrapped in a {{TikaException}}.
> See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1079:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Word document hits AIOOBE in SummaryExtractor.parseSummaries
> 
>
> Key: TIKA-1079
> URL: https://issues.apache.org/jira/browse/TIKA-1079
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.6
>
> Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc
>
>
> I'm not yet sure if this is a corrupted document (though, MS Word opens it 
> just fine) or a bug in POI ... but I hit this exc when running it through 
> TikaCLI:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: -1
>   at org.apache.poi.hpsf.CodePageString.(CodePageString.java:161)
>   at 
> org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
>   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
>   at org.apache.poi.hpsf.Property.(Property.java:164)
>   at org.apache.poi.hpsf.Section.(Section.java:277)
>   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
>   at org.apache.poi.hpsf.PropertySet.(PropertySet.java:246)
>   at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
>   at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1072:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> AIOOBE when handling embedded document in .doc file
> ---
>
> Key: TIKA-1072
> URL: https://issues.apache.org/jira/browse/TIKA-1072
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.6
>
> Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin
>
>
> I have a Word (.doc) document that hits an exception when I run:
> {noformat}
> java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
> /x/tmp/20-Force-on-a-current-S00.doc 
> {noformat}
> Here's the exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
>   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
>   at 
> org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:139)
>   at 
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> {noformat}
> It happens when we try to parse an OLE10 embedded object ... the code
> that does this parsing captures and ignores Ole10NativeException and
> skips the entry ... so I'm wondering if we should also catch AIOOBE
> and skip the entry?  Ie, maybe this entry really is not OLE10, and the
> Ole10Native code is failing to throw Ole10NativeException for it?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1108:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Represent individual slides in pptx
> ---
>
> Key: TIKA-1108
> URL: https://issues.apache.org/jira/browse/TIKA-1108
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.6
>
>
> When parsing ppt, tika produces for each slide:
> 
> However for pptx these seem to be missing, all the text is directly under 
> .



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-987:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
> 
>
> Key: TIKA-987
> URL: https://issues.apache.org/jira/browse/TIKA-987
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.6
>
> Attachments: picture.doc, picture_3.doc
>
>
> I have two Word docs, both containing the same drawing, but one has
> text added.
> In one case (picture.doc) the extraction is correct: it contains only
> an embedded image.wmf; when I view the image it's correct.
> In the second case (picture_3.doc) the picture is extracted as image
> (no extension), and is 0 bytes, and there is an invalid character
> (mapped to unicode replacement char) inserted before the image:
> {noformat}
> 
> 
> �
> 
> 
> vehicle
> 
> {noformat}
> (Though, the text "vehicle" is extracted correctly).
> I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
> MERGEFORMAT} field, which we invoke
> WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
> the 0-byte no-extension image as well as the invalid character.  With
> the first doc there is no field (at least not one that's handle with
> handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
> fix... it could be something is going wrong in how POI parses the
> Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-988:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> We don't extract a placeholder for a Word document embedded in an Excel 
> document
> 
>
> Key: TIKA-988
> URL: https://issues.apache.org/jira/browse/TIKA-988
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.6
>
> Attachments: bug31373.xls
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded 
> document appears, we output a  tag.
> It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-980:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> MicrodataContentHandler for Apache Tika
> ---
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Ken Krugler
> Fix For: 1.6
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1106:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: Wish
>  Components: general
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Priority: Minor
>  Labels: entity, geospatial
> Fix For: 1.6
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1208:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.6
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1220:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Parser implementration for IFC files
> 
>
> Key: TIKA-1220
> URL: https://issues.apache.org/jira/browse/TIKA-1220
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6
>
> Attachments: 2012-03-23-Duplex-Programming.ifc
>
>
> The Industry Foundation Classes (IFC) [0] data model is intended to describe 
> building and construction industry data. For the sake of argument, it can be 
> considered as a more intelligent successor to the .dwg data models used 
> within CAD models.
> I've tracked down a potential 3rd party library [1] which we maybe able to 
> wrap and use within Tika however the provided software packages are licensed 
> under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
> over on legal-discuss@ in an attempt to see if it is possible to wrap some 
> code and contribute it to tika-parsers.
> When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
> to help the developers package the code as a Maven artifact(s), then I will 
> progress with writing the implementation.  
> [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
> [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-891:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.6
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1231) Safely handle null embedded files in PDFs

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1231:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Safely handle null embedded files in PDFs
> -
>
> Key: TIKA-1231
> URL: https://issues.apache.org/jira/browse/TIKA-1231
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.6
>
>
> I filed a potential fix, unit test and test doc for this in PDFBOX-1884.  
> We'll need to add one test for null in the Tika PDFParser to handle this 
> change once it is fixed in PDFBox.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-973) PDF form data isn't included in extracted content.

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-973.
--

Resolution: Fixed

> PDF form data isn't included in extracted content.
> --
>
> Key: TIKA-973
> URL: https://issues.apache.org/jira/browse/TIKA-973
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.2
>Reporter: Michael Graessle
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz, 
> i-9_screenshot.png
>
>
> When extracting content from PDFs, PDF form data isn't extracted. 
> The following code extracts this data via PDF box, but it seems like 
> something Tika should be doing.
> PDDocumentCatalog docCatalog = load.getDocumentCatalog();
> if (docCatalog != null) {
>   PDAcroForm acroForm = docCatalog.getAcroForm();
>   if (acroForm != null) {
>   @SuppressWarnings("unchecked")
>   List fields = acroForm.getFields();
>   if (fields != null && fields.size() > 0) {
> documentContent.append(" ");
> for (PDField field : fields) {
>   if (field.getValue()!=null) {
> documentContent.append(field.getValue());
> documentContent.append(" ");
>   }
> }
>   }
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1223:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> {noformat}
> embeddedRelationId format is thumbnail_{i}.{extension}.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1205:
--

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

> Allow PDFParser to fallback to other parser if there is an exception
> 
>
> Key: TIKA-1205
> URL: https://issues.apache.org/jira/browse/TIKA-1205
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.6
>
>
> With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
> instead of the traditional parser for parsing PDF files.  Following the 
> description in PDFBOX-1199, it would be useful to allow fallback to the 
> classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
> of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
> this parameter is true, and if Tika's PDFParser is using the classic parser, 
> Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
> if this parameter is true and if Tika's PDFParser is using the 
> NonSequentialPDFParser it will fallback to the classic parser if there is an 
> IOException.
> Many thanks to Hong-Thai for championing the addition of the added 
> NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
> PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


buildbot failure in ASF Buildbot on tika-trunk

2014-02-04 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/1151

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: portunus_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1564580
Blamelist: dmeikle

BUILD FAILED: failed compile

sincerely,
 -The Buildbot





buildbot success in ASF Buildbot on tika-trunk

2014-02-04 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/1153

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: portunus_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1564605
Blamelist: dmeikle

Build succeeded!

sincerely,
 -The Buildbot





[VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread David Meikle
Hi Guys,

A candidate for the Tika 1.5 release is now available at:
http://people.apache.org/~dmeikle/tika-1.5-rc1/

The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/

The SHA1 checksum of the archive is:
66adb7e73058da73a055a823bd61af48129c1179

A staged M2 repository can also be found on repository.apache.org here:
https://repository.apache.org/content/repositories/orgapachetika-1000

Please vote on releasing this package as Apache Tika 1.5.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

   [ ] +1 Release this package as Apache Tika 1.5
   [ ] -1 Do not release this package because...

Here is my +1 for the release.

Cheers,
Dave

Unsubscribe?

2014-02-04 Thread A Z
Can someone unsubscribe  or tell me how to do this from this
Tika emai list?




> Date: Tue, 4 Feb 2014 23:18:18 +
> From: j...@apache.org
> To: dev@tika.apache.org
> Subject: [jira] [Updated] (TIKA-1205) Allow PDFParser to fallback to other 
> parser if there is an exception
> 
> 
>  [ 
> https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Dave Meikle updated TIKA-1205:
> --
> 
> Fix Version/s: (was: 1.5)
>1.6
> 
> Pushed out to 1.6, preparing for 1.5 RC
> 
> > Allow PDFParser to fallback to other parser if there is an exception
> > 
> >
> > Key: TIKA-1205
> > URL: https://issues.apache.org/jira/browse/TIKA-1205
> > Project: Tika
> >  Issue Type: Improvement
> >  Components: parser
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Trivial
> > Fix For: 1.6
> >
> >
> > With TIKA-1201, there is now an option to use PDFBox's 
> > NonSequentialPDFParser instead of the traditional parser for parsing PDF 
> > files.  Following the description in PDFBOX-1199, it would be useful to 
> > allow fallback to the classic parser if NonSequentialPDFParser throws an 
> > IOException.  For the sake of symmetry, I propose a boolean 
> > useParserFallbackOnException parameter.  If this parameter is true, and if 
> > Tika's PDFParser is using the classic parser, Tika will fallback to the 
> > NonSequentialPDFParser if there is an IOException; if this parameter is 
> > true and if Tika's PDFParser is using the NonSequentialPDFParser it will 
> > fallback to the classic parser if there is an IOException.
> > Many thanks to Hong-Thai for championing the addition of the added 
> > NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
> > PDFBox's NonSequentialPDFParser (PDFBOX-1199)!
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
  

Re: [VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread Chris Mattmann
Hi Dave,

You ROCK!

I cannot verify the SIGS though:

[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/stage_apache_rc
tika 1.5-src http://people.apache.org/~dmeikle/tika-1.5-rc1/
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100 41.3M  100 41.3M0 0   450k  0  0:01:33  0:01:33 --:--:--
609k
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100   819  100   8190 0   3854  0 --:--:-- --:--:-- --:--:--
7247
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
10033  100330 0147  0 --:--:-- --:--:-- --:--:--
300
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/verify_gpg_sigs
Verifying Signature for file tika-1.5-src.zip.asc
gpg: Signature made Tue Feb  4 20:34:06 2014 EST using RSA key ID 0EB30B07
gpg: Can't check signature: No public key
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% curl -O
http://people.apache.org/~dmeikle/tika-1.5-rc1/KEYS
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100  9258  100  92580 0   3801  0  0:00:02  0:00:02 --:--:--
4638
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% gpg --import < KEYS
gpg: key A355A63E: "Jukka Zitting " not changed
gpg: key B876884A: "Chris Mattmann (CODE SIGNING KEY)
" not changed
gpg: key 9740DD55: "David Meikle (CODE SIGNING KEY) "
not changed
gpg: Total number processed: 3
gpg:  unchanged: 3
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/verify_gpg_sigs
Verifying Signature for file tika-1.5-src.zip.asc
gpg: Signature made Tue Feb  4 20:34:06 2014 EST using RSA key ID 0EB30B07
gpg: Can't check signature: No public key
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann%
$HOME/bin/verify_md5_checksums
md5sum: stat '*.tar.gz': No such file or directory
md5sum: stat '*.bz2': No such file or directory
md5sum: stat '*.tgz': No such file or directory
tika-1.5-src.zip: OK
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann%


Checksums check out.

Can you scope the SIGS problem?

Cheers,
Chris



-Original Message-
From: David Meikle 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, February 4, 2014 5:59 PM
To: "dev@tika.apache.org" 
Cc: "u...@tika.apache.org" 
Subject: [VOTE] Apache Tika 1.5 RC1

>Hi Guys,
>
>A candidate for the Tika 1.5 release is now available at:
>http://people.apache.org/~dmeikle/tika-1.5-rc1/
>
>The release candidate is a zip archive of the sources in:
>http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/
>
>The SHA1 checksum of the archive is:
>66adb7e73058da73a055a823bd61af48129c1179
>
>A staged M2 repository can also be found on repository.apache.org here:
>https://repository.apache.org/content/repositories/orgapachetika-1000
>
>Please vote on releasing this package as Apache Tika 1.5.
>The vote is open for the next 72 hours and passes if a majority of at
>least three +1 Tika PMC votes are cast.
>
>   [ ] +1 Release this package as Apache Tika 1.5
>   [ ] -1 Do not release this package because...
>
>Here is my +1 for the release.
>
>Cheers,
>Dave




Re: [VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread Chris Mattmann
OK worked this time grabbing tika.asc. Used it to validate GPG and
looks good. Should probably either remove the KEYS file and simply
point to [1] though.

Cheers and +1 from me!

Cheers,
Chris

[1] https://people.apache.org/keys/group/tika.asc

[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% curl -O
https://people.apache.org/keys/group/tika.asc
  % Total% Received % Xferd  Average Speed   TimeTime Time
Current
 Dload  Upload   Total   SpentLeft
Speed
100  133k  100  133k0 0  55054  0  0:00:02  0:00:02 --:--:--
58432
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% gpg --import < tika.asc
gpg: key B876884A: "Chris Mattmann (CODE SIGNING KEY)
" not changed
gpg: key A355A63E: "Jukka Zitting " 7 new signatures
gpg: key 8A26D9A6: public key "Jukka Zitting "
imported
gpg: key 42CFAE07: public key "Jukka Zitting (CODE SIGNING KEY)
" imported
gpg: key 0EB30B07: public key "David Meikle (CODE SIGNING KEY)
" imported
gpg: key D84E41AE: public key "Nick Burch " imported
gpg: key E7AC2BA5: public key "Oleg Tikhonov " imported
gpg: key 6E68DA61: public key "Michael McCandless (CODE SIGNING KEY)
" imported
gpg: key 95D21F2E: public key "Ray Gauss II (CODE SIGNING KEY)
" imported
gpg: key DEDEAB92: public key "Sergey Beryozkin (Release Management)
" imported
gpg: key 97EDDE66: public key "tallison (apache_distro_keys)
" imported
gpg: Total number processed: 11
gpg:   imported: 9  (RSA: 6)
gpg:  unchanged: 1
gpg: new signatures: 7
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   3  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 3u
gpg: next trustdb check due at 2015-08-18
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/verify_gpg_sigs
Verifying Signature for file tika-1.5-src.zip.asc
gpg: Signature made Tue Feb  4 20:34:06 2014 EST using RSA key ID 0EB30B07
gpg: Good signature from "David Meikle (CODE SIGNING KEY)
"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the
owner.
Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann%






-Original Message-
From: Chris Mattmann 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, February 4, 2014 10:48 PM
To: "dev@tika.apache.org" 
Cc: "u...@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.5 RC1

>Hi Dave,
>
>You ROCK!
>
>I cannot verify the SIGS though:
>
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/stage_apache_rc
>tika 1.5-src http://people.apache.org/~dmeikle/tika-1.5-rc1/
>  % Total% Received % Xferd  Average Speed   TimeTime Time
>Current
> Dload  Upload   Total   SpentLeft
>Speed
>100 41.3M  100 41.3M0 0   450k  0  0:01:33  0:01:33 --:--:--
>609k
>  % Total% Received % Xferd  Average Speed   TimeTime Time
>Current
> Dload  Upload   Total   SpentLeft
>Speed
>100   819  100   8190 0   3854  0 --:--:-- --:--:-- --:--:--
>7247
>  % Total% Received % Xferd  Average Speed   TimeTime Time
>Current
> Dload  Upload   Total   SpentLeft
>Speed
>10033  100330 0147  0 --:--:-- --:--:-- --:--:--
>300
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/verify_gpg_sigs
>Verifying Signature for file tika-1.5-src.zip.asc
>gpg: Signature made Tue Feb  4 20:34:06 2014 EST using RSA key ID 0EB30B07
>gpg: Can't check signature: No public key
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% curl -O
>http://people.apache.org/~dmeikle/tika-1.5-rc1/KEYS
>  % Total% Received % Xferd  Average Speed   TimeTime Time
>Current
> Dload  Upload   Total   SpentLeft
>Speed
>100  9258  100  92580 0   3801  0  0:00:02  0:00:02 --:--:--
>4638
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% gpg --import < KEYS
>gpg: key A355A63E: "Jukka Zitting " not changed
>gpg: key B876884A: "Chris Mattmann (CODE SIGNING KEY)
>" not changed
>gpg: key 9740DD55: "David Meikle (CODE SIGNING KEY) "
>not changed
>gpg: Total number processed: 3
>gpg:  unchanged: 3
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann% $HOME/bin/verify_gpg_sigs
>Verifying Signature for file tika-1.5-src.zip.asc
>gpg: Signature made Tue Feb  4 20:34:06 2014 EST using RSA key ID 0EB30B07
>gpg: Can't check signature: No public key
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann%
>$HOME/bin/verify_md5_checksums
>md5sum: stat '*.tar.gz': No such file or directory
>md5sum: stat '*.bz2': No such file or directory
>md5sum: stat '*.tgz': No such file or directory
>tika-1.5-src.zip: OK
>[chipotle:~/tmp/apache-tika-1.5-rc1] mattmann%
>
>
>Checksums check out.
>
>Can you scope the SIGS problem?
>
>Cheers,
>Chris
>
>
>
>-Original Message-
>From: David Meikle 
>Reply-To: "dev@tika.apache.org" 
>Da

Re: [VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread Oleg Tikhonov
Hi David,
 [x] +1 Release this package as Apache Tika 1.5

Thanks!
BR,
Oleg


On Wed, Feb 5, 2014 at 3:59 AM, David Meikle  wrote:

> Hi Guys,
>
> A candidate for the Tika 1.5 release is now available at:
> http://people.apache.org/~dmeikle/tika-1.5-rc1/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/
>
> The SHA1 checksum of the archive is:
> 66adb7e73058da73a055a823bd61af48129c1179
>
> A staged M2 repository can also be found on repository.apache.org here:
> https://repository.apache.org/content/repositories/orgapachetika-1000
>
> Please vote on releasing this package as Apache Tika 1.5.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.5
>[ ] -1 Do not release this package because...
>
> Here is my +1 for the release.
>
> Cheers,
> Dave