Re: Searching for Tika Jira issues using Lucene

2014-03-06 Thread Michael McCandless
Woops, thank you for moving this to the right mailing list Oleg!

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 6, 2014 at 12:56 AM, Oleg Tikhonov o...@apache.org wrote:
 Hi Mike!
 Sounds great! Thanks.

 Oleg


 On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Team,

 If you want to search for Tika Jira issues, I just added Tika coverage
 into the Lucene dog food server we use for finding Lucene/Solr
 issues at http://jirasearch.mikemccandless.com.

 I just posted a blog post describing recent changes:


 http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html

 Basically I started this as an effort to test Lucene's functionality
 in a real application/server (searching for issues), and to eat our
 own dog food, but then over time I think it's proven quite useful
 and I now use it almost exclusively when I need to find a Lucene issue.

 Compared to Jira's builtin search, it's more full text like; e.g.,
 makes suggestions as you type, produces snippets and highlights, ranks
 by blended relevence+recency, etc.  It has facets so you can quickly
 drill down/sideways by various metadata.  In the results, you can
 click on a snippet to go straight to the specific comment and issue
 that it came from.

 It uses Lucene's near-real-time indexing + searching, so issue updates
 should be visible within ~ 30 seconds or so.

 I hope you find it useful too!

 Mike McCandless

 http://blog.mikemccandless.com



[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321
 ] 

Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:
---

Hi, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!


was (Author: alexandre.madur...@gmail.com):
Hi again, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Using guava on tika ?

2014-03-06 Thread Nick Burch

On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote:
Guava (https://code.google.com/p/guava-libraries/) provides many 
facilities on text, file, collection ... manipuation. Should we use in 
Tika ?


Can you give an example of where using Guava would either simplify some 
existing code, or improve its effectiveness, or permit something we 
couldn't otherwise do?


Nick


[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1257:
--

 Summary: MS Word Filter out control characters on ouput
 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6


Control characters present mostly in table of index and un-visualizable. We 
should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: tika-doc-control-char.png
5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1257.


Resolution: Fixed

Fixed on r1574874

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


buildbot failure in ASF Buildbot on tika-trunk

2014-03-06 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/1169

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: portunus_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1574874
Blamelist: thaichat04

BUILD FAILED: failed compile

sincerely,
 -The Buildbot





RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
Hi,
Anyone can create branch remotes/origin/1.5  on git ?

Thanks

Hong-Thai


-Message d'origine-
De : David Meikle [mailto:loo...@gmail.com] De la part de David Meikle
Envoyé : mercredi 19 février 2014 23:19
À : annou...@apache.org
Cc : dev@tika.apache.org; u...@tika.apache.org
Objet : [ANNOUNCE] Apache Tika 1.5 Released

The Apache Tika project is pleased to announce the release of Apache Tika 1.5. 
The release contents have been pushed out to the main Apache release site and 
to the Maven Central sync, so the releases should be available as soon as the 
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and structured 
text content from various documents using existing parser libraries.

Apache Tika 1.5 contains a number of improvements and bug fixes. Details can be 
found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.5.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.5-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from the 
Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors. When 
downloading from a mirror site, please remember to verify the downloads using 
signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Dave Meikle, on behalf of the Apache Tika community



buildbot success in ASF Buildbot on tika-trunk

2014-03-06 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/1170

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: portunus_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1574877
Blamelist: thaichat04

Build succeeded!

sincerely,
 -The Buildbot





[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922490#comment-13922490
 ] 

Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM:


Fixed on r1574874  r1574877


was (Author: thaichat04):
Fixed on r1574874

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Jukka Zitting
Hi,

On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen
hong-thai.ngu...@polyspot.com wrote:
 Anyone can create branch remotes/origin/1.5  on git ?

Do we need a 1.5 branch?

BR,

Jukka Zitting


RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
I guess that users could maintain hotfixes basing on a released branch in 
attending next release.  We have already branches for old releases:
hong-thai.nguyen@HTN-PC /c/git/tika (trunk)
$ git branch -a
* trunk
  remotes/origin/0.1-incubating
  remotes/origin/0.10
  remotes/origin/0.2
  remotes/origin/0.3
  remotes/origin/0.4-rc1
  remotes/origin/0.4-rc2
  remotes/origin/0.5
  remotes/origin/0.6
  remotes/origin/0.7
  remotes/origin/0.8
  remotes/origin/0.9
  remotes/origin/0.x
  remotes/origin/1.2
  remotes/origin/1.3
  remotes/origin/1.4
  remotes/origin/HEAD - origin/trunk
  remotes/origin/TIKA-204
  remotes/origin/trunk

Hong-Thai


-Message d'origine-
De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Envoyé : jeudi 6 mars 2014 15:48
À : Tika Development
Objet : Re: [ANNOUNCE] Apache Tika 1.5 Released

Hi,

On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen 
hong-thai.ngu...@polyspot.com wrote:
 Anyone can create branch remotes/origin/1.5  on git ?

Do we need a 1.5 branch?

BR,

Jukka Zitting


Re: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Jukka Zitting
Hi,

On Thu, Mar 6, 2014 at 10:14 AM, Hong-Thai Nguyen
hong-thai.ngu...@polyspot.com wrote:
 I guess that users could maintain hotfixes basing on a released branch in 
 attending next release.

Right, at least there's no harm in having the branch, so I just
created it in revision 1574919.

BR,

Jukka Zitting


Re: Using guava on tika ?

2014-03-06 Thread Konstantin Gribov
If you will bring it as a dependency -- don't use guava 15, use guava 16.
It breaks CDI in major appservers (jboss as 7, glassfish 3, websphere) with
incorrect beans.xml.

See https://issues.jboss.org/browse/WELD-1007 and
https://code.google.com/p/guava-libraries/issues/detail?id=1527.

-- 
Best regards,
Konstantin Gribov.


2014-03-06 15:54 GMT+04:00 Nick Burch apa...@gagravarr.org:

 On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote:

 Guava (https://code.google.com/p/guava-libraries/) provides many
 facilities on text, file, collection ... manipuation. Should we use in Tika
 ?


 Can you give an example of where using Guava would either simplify some
 existing code, or improve its effectiveness, or permit something we
 couldn't otherwise do?

 Nick



Re: Using guava on tika ?

2014-03-06 Thread Jukka Zitting
Hi,

On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch apa...@gagravarr.org wrote:
 On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote:
 Guava (https://code.google.com/p/guava-libraries/) provides many
 facilities on text, file, collection ... manipuation. Should we use in Tika?

 Can you give an example of where using Guava would either simplify some
 existing code, or improve its effectiveness, or permit something we couldn't
 otherwise do?

Also, especially in tika-core we've explicitly avoided any external
dependencies to keep it as simple and easy as possible to include as a
dependency in client applications. We've even gone as far as including
copies of some Commons IO classes in org.apache.tika.io instead of
referring to commons-io as a dependency.

BR,

Jukka Zitting


RE: Using guava on tika ?

2014-03-06 Thread Hong-Thai Nguyen
Thank for feedback.
Nothing we can't do with our code :) Guava is just 'facilities' make code more 
clear, shorter and sometime faster.
I agree that this integration brings more dependencies, may create conflicts in 
end-users applications. Leave as it for now.

Cheers,

Hong-Thai

-Message d'origine-
De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Envoyé : jeudi 6 mars 2014 16:47
À : Tika Development
Objet : Re: Using guava on tika ?

Hi,

On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch apa...@gagravarr.org wrote:
 On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote:
 Guava (https://code.google.com/p/guava-libraries/) provides many 
 facilities on text, file, collection ... manipuation. Should we use in Tika?

 Can you give an example of where using Guava would either simplify 
 some existing code, or improve its effectiveness, or permit something 
 we couldn't otherwise do?

Also, especially in tika-core we've explicitly avoided any external 
dependencies to keep it as simple and easy as possible to include as a 
dependency in client applications. We've even gone as far as including copies 
of some Commons IO classes in org.apache.tika.io instead of referring to 
commons-io as a dependency.

BR,

Jukka Zitting


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc)

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: testControlCharacters.doc

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: testControlCharacters.doc, tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922736#comment-13922736
 ] 

Tim Allison commented on TIKA-1232:
---

Fixed r1574959.  Reopen if any tweaks remain to me made.  Thank you, all, for 
your contributions!

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1232) Add PDF version to PDFParser output

2014-03-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1232.
---

Resolution: Fixed

r1574959

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: Tika 1.5 vs 1.4 testing

2014-03-06 Thread Allison, Timothy B.
Hong-Thai,
  Thank you for running these tests.  I suspect (mea culpa) that the increase 
in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which 
was not fixed before 1.5 was cut.
  I recently made major modifications to the metadata extraction components of 
the PDFParser (TIKA-1232 and TIKA-1252).  If you have time, would you mind 
rerunning these tests with trunk on your test corpus?  I'd be interested to see 
if the temporary fix to TIKA-1233 lowers the number of PDF runtime exception 
failures, and I'd be very interested to see if there are any surprises caused 
by 1232 and 1252.
  Thank you!

 Best,

   Tim


-Original Message-
From: Hong-Thai Nguyen [mailto:hong-thai.ngu...@polyspot.com] 
Sent: Monday, March 03, 2014 8:19 AM
To: dev@tika.apache.org
Subject: Tika 1.5 vs 1.4 testing

Hi all,

I've checked on same corpus. Here's the comparaison :
||Tika||POI||PDFbox||Failed docs||
|1.4|3.9|1.8.1|92|
|1.5|3.10-beta2|1.8.4|182|

== TIKA 1.4 
- pdf (7)
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (8)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4db190a5
- doc (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- ppt (40)
   * (39) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- xls (9)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- dwg (4)
   * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: 
AC1014
- odp (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7286f080
- rtf (13)
   * (13) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@455a7af4
- pps (5)
   * (5) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2

== TIKA 1.5 
- pdf (16)
   * (10) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (19)
   * (7) 

[jira] [Resolved] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1252.
---

Resolution: Fixed

Fixed as of r1574964.  Thank you, Alexandre, for raising this issue and for 
supplying test files for this and TIKA-1232!

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922890#comment-13922890
 ] 

Luis Filipe Nassif commented on TIKA-623:
-

Good job. I think a possible improvement would be to generate a html for each 
email, containing its metadata and content, and call the embeddedExtractor to 
process the generated html, instead of printing all emails directly to 
xhtmlContentHandler.  So, in addition to attachments, emails could also be 
extracted from PST files if that is the goal of the application. What do you 
think?

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922921#comment-13922921
 ] 

Tim Allison commented on TIKA-623:
--

Agreed.  Is there any way to reuse OutlookParser or to refactor so that we're 
using the same lib for an email, whether .pst or .msg.  There are lots of 
lessons learned embedded in the OutlookParser.  I'll be happy to chip in as I 
can.  [~thaichat04], thank you for getting this rolling!

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Unconsistent logging in current tika (1.5)

2014-03-06 Thread Konstantin Gribov
Hi, folks.

Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses
commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf)
and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some
parsers (like pdfbox) logs just to stdout/stderr.

It's confusing.

Tika-core use only JUL.
Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api.
Tika-app use JCL, configures log4j in runtime (to change verbosity level)
and depends on slf4j-log4j12.
Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf).

What do you think about change all the logging to actual slf4j and
excluding JCL from dependencies at all?

First option group is about add slf4j-api to tika-core dependencies or not.
If it's added we won't use JUL. If it isn't added -- jul-to-slf4j can be
added to tika-parsers deps.

Second option group is related to commons-logging. We can:
- exclude it and force developer to add either jcl-over-slf4j or
commons-logging as dependency,
- exclude it and add jcl-over-slf4j as dependency, so someone uses JCL will
be forced to exclude jcl-over-slf4j,
- leave it and force one to use either slf4j-jcl + commons-logging or
exclude commons-logging and include jcl-over-slf4j.

I think, second way is preferred because developer can use any slf4j
backend and will be forced to do something only when he/she is using JCL.

Third option group is about backend for slf4j. We can use log4j or logback.
I prefer logback-classic but we can use any of them. Either of them
supports log level changing in runtime.

I can refactor tika codebase to use logging in consistent manner and create
pull request on github or jira ticket with patch after that, if my solution
on this issue will be accepted.

By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that
depends on newer slf4j-api 1.6.1.

-- 
Best regards,
Konstantin Gribov.


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923019#comment-13923019
 ] 

Tim Allison commented on TIKA-1252:
---

[~alexandre.madur...@gmail.com], before opening an issue in PDFBOX on the seq 
vs bag, let me see if that issue disappears if we move to xmpbox from jempbox.  
I've only had a chance to look at the source, but I think that will prevent us 
from having to reinvent the fix.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923399#comment-13923399
 ] 

Tim Allison commented on TIKA-1252:
---

Not immediately obvious to me how to use xmpbox with a regular PDDocument not 
generated by preflight's parser.  Will stick with our jempbox work around for 
now.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Unconsistent logging in current tika (1.5)

2014-03-06 Thread Nick Burch

On Fri, 7 Mar 2014, Konstantin Gribov wrote:
Tika-core is quite pure (uses only java.util.logging) but tika-parsers 
uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through 
netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). 
Also some parsers (like pdfbox) logs just to stdout/stderr.


I think part of the issue is that many of the libraries that Tika depends 
on have their own chosen logging library / setup. IIRC, the Tika parsers 
often log in a similar manner to the underlying library they use.


That's not to say that we can't tidy things up a bit, but it does restrict 
how much we can do where log messages come from underlying libraries



It's confusing.

Tika-core use only JUL.


Tika-Core ideally shouldn't have any external depdencies, so I'm not sure 
what else it can use while maintaining that?



Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api.
Tika-app use JCL, configures log4j in runtime (to change verbosity level)
and depends on slf4j-log4j12.
Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf).


Potentially some of these could be rationalised, though maybe the best we 
can hope for is to ensure they only use whatever their underlying 
dependencies use


By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that 
depends on newer slf4j-api 1.6.1.


Can you open a jira for that upgrade? If you can also try it locally, and 
report on the jira if all the unit tests still pass, that'd be a help!


Thanks
Nick