[jira] [Commented] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214082#comment-15214082 ] Stefano Fornari commented on TIKA-1436: --- sorry, it took much more than I expected...

[jira] [Comment Edited] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)
8 AM: - patch 0001-Improvment-as-described-in-https-issues.apache.org-j.patch was (Author: stefanofornari): see comment on 20160328 > improvement to PDFParser > > > Key: TIKA-1436 > URL: https://issues.apache.org/ji

[jira] [Updated] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)
20160328 > improvement to PDFParser > > > Key: TIKA-1436 > URL: https://issues.apache.org/jira/browse/TIKA-1436 > Project: Tika > Issue Type: Improvement > Components: parser &g

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214107#comment-15214107 ] Tim Allison commented on TIKA-1285: --- Y, that's what I was thinking about doing with shadi

[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214111#comment-15214111 ] Tim Allison commented on TIKA-1285: --- As I mentioned on the pdfbox dev list, I'm hesitant

[jira] [Comment Edited] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214111#comment-15214111 ] Tim Allison edited comment on TIKA-1285 at 3/28/16 11:31 AM: - A

[jira] [Commented] (TIKA-1566) Try to migrate current Tika code around PDFBox 1.8.x from JempBox to XMPBox

2016-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214112#comment-15214112 ] Tim Allison commented on TIKA-1566: --- For the sake of posterity, see this [comment|https:

[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214115#comment-15214115 ] Tim Allison commented on TIKA-1910: --- bq. Using the proxies we can make those dependencies

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Nick Burch
On Sun, 27 Mar 2016, Bob Paulin wrote: Yes I think overall if these functions can live in somewhere either inside tika or a smaller dependent library we're in a better place. I'll take a look at Ogg-Vorbis. The two util classes there, that spring to mind, are: https://github.com/Gagravarr/Vorb

[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-28 Thread Bob Paulin (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214183#comment-15214183 ] Bob Paulin commented on TIKA-1910: -- Does this mean that if someone doesn't include th

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Nick Burch
On Sun, 27 Mar 2016, Bob Paulin wrote: Tika's IOUtils appears to be missing the readFully method. Should that be added? There was discussion about getting rid of the Tika IOUtils method in favour of depending on commons-io. If that method is on commons-io, then we could use that without need

[jira] [Commented] (TIKA-1908) --list-met-models does not display Dublin core along with other metadata models

2016-03-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214214#comment-15214214 ] Nick Burch commented on TIKA-1908: -- Namespace'd properties, eg https://tika.apache.org/1.

RE: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Ken Krugler
Hi Bob, > From: Nick Burch > Sent: March 28, 2016 6:49:09am PDT > To: dev@tika.apache.org > Subject: Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils > > On Sun, 27 Mar 2016, Bob Paulin wrote: >> Tika's IOUtils appears to be missing the readFully method. Should that be >> added? > >

[jira] [Assigned] (TIKA-1911) OpenNLP based SentimentAnalysisParser

2016-03-28 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1911: --- Assignee: Chris A. Mattmann > OpenNLP based SentimentAnalysisParser >

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Bob Paulin
Ken, Thank you for reminding me of this issue. Seems we had come to the agreement to use commons-io in a later version. Doing this in tika-core would make it a transitive dependency to all the 2.0 parsers which again would just leave the string utils and LittleEndian code to port over to a libra

[jira] [Created] (TIKA-1912) Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but not by 2.0.0

2016-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1912: - Summary: Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but not by 2.0.0 Key: TIKA-1912 URL: https://issues.apache.org/jira/browse/TIKA-1912 Proje

Re: GSOC2016 Sentiment Analysis

2016-03-28 Thread Mattmann, Chris A (3980)
Dear Anthony, Great! These both sound like fantastic proposals and I’m happy to be a mentor. Madhawa, would you like to join in on these efforts? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science

Re: GSOC2016 Sentiment Analysis

2016-03-28 Thread Madhawa Kasun Gunasekara
Hi Chris / Antony yes I would like to work on this, This proposal address most of the things in Sentiment analysis, AFAIK most of the people use OpenNLP Document Categorizer for Sentiment Analysis, since there isn't a proper functionality to do sentiment analysis in OpenNLP, This would be great if

[jira] [Assigned] (TIKA-1897) Too many daemon threads when NamedEntityParser is enabled

2016-03-28 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1897: --- Assignee: Chris A. Mattmann > Too many daemon threads when NamedEntityParser is enable

Re: GSOC2016 Sentiment Analysis

2016-03-28 Thread Mondher Bouazizi
Dear Madhawa, Thank you for your interest in the proposals. The current tasks we proposed refer to the classification and quantification regardless of the topic. This can be used in a larger context where the topic is not specified, or not unique, in which case we will need to identify the topic(s