Re: [VOTE] Moving SCM to Git

2016-01-13 Thread Julien Nioche
+1

On 2 January 2016 at 04:30, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Everyone,
>
> DISCUSS thread here: http://s.apache.org/wVE
>
> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
> page for our SCM explaining how to use Git at Apache, and how to
> use it with Github, and how to use it even in a traditional SVN
> sense. The page is here:
>
> https://wiki.apache.org/tika/UsingGit
>
>
> I’ve also linked it from the main wiki page. I took the liberty
> of updating the only other 2 pages on the wiki that referenced
> SCM with (pending) Git instructions as well:
>
> https://wiki.apache.org/tika/DeveloperResources
> https://wiki.apache.org/tika/ReleaseProcess
>
> From the DISCUSS thread it would seem the following members of
> the community support this move:
>
> Chris Mattmann
> Tyler Palsulich
> Bob Paulin
> Hong-Thai Nguyen
>
> Oleg Tikhonov
> David Meikle
>
>
> Given the above I’m going to count the above people as +1 in
> this VOTE if I don’t hear otherwise.
>
> Nick Burch said he would be more supportive if there was a guide,
> so I made one and updated the other wiki docs as above so hopefully
> that garners his VOTE.
>
> If you’d like to revise your VOTE or to VOTE for the first time,
> please use the ballot below:
>
> [ ] +1 Move the Apache Tika source control to Writeable Git repos
> at the ASF
> [ ] +0 Indifferent.
> [ ] -1 Don’t move the Apache Tika source control to Writeable Git
> repos at the ASF because..
>
> Of course, given the conversation I am +1 for this.
>
> Thanks for VOTE’ing I’ll leave the VOTE open through next Friday.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Tika questions on StackOverflow

2016-01-13 Thread Nick Burch

Hi All

This may be old news for some of you, in which case you can skip the 
email, but for others... StackOverflow is a programming-focused question 
and answer site, with excellent google-foo, quite wide use, and growing 
use. At the moment I'd say there's something like a new Tika question a 
day on it, and that number seems to be climbing. (It's quite bursty 
though, 2 one day, nothing for the next few)


Increasingly, new users seem to be turning to StackOverflow to get help 
with projects, learn how to use them etc, in place of joining a mailing 
list and asking a question. There's also a lot of people out there who 
know about Tika, aren't on our lists, but are posting helpful replies 
(answers) to questions on how to use Tika.


(There's also a fair number of useless people asking very basic questions, 
without full information, and without having done any research / checked 
existing questions / checked out site / etc. They tend to get moderated 
down pretty quickly though, or they learn and edit the question)


Because StackOverflow gets a lot of newbie traffic, they have some rules, 
and can be quite strict on enforcing them. A lot stricted than many of the 
other StackExchange network sites, largely because of that traffic. That 
means you will find some restrictions at the start, but they go away soon. 
You do need to be careful to actually answer questions with an answer, 
asking for clarifications or saying "can't help, ask on the list" as an 
answer won't go down well.



If you're interested to see what sort of questions there are, see
http://stackoverflow.com/questions/tagged/apache-tika?sort=newest=50
for what has been asked recently, and
http://stackoverflow.com/questions/tagged/apache-tika?sort=votes=50
for the most "popular"


There are a few of us on StackOverflow already, but you might want to join 
in too. You certainly don't have to! But you might want to, not only to 
help, but also to get bug reports, find out what docs we need to update, 
and maybe even spot people answering who we can ask to join the project.


If you sign up for an account, you can get emails when people ask Tika 
related questions, so you can know to go look if it interests you. To do 
that, go to

http://stackexchange.com/filters/212512/apache-tika-questions
On the right it should have an "Email Updates" box, where you can 
subscribe to get emailed for new questions on a timing of your choice



If you have questions on using StackOverflow, I'm happy to do my best to 
explain. They have pretty good help/documentation, and they have the 
"meta" site to check policies / why reasons / etc.


You will suffer some restrictions as a new user, but they go away when 
your answers get a few up-votes. Let us know your username if you sign up 
and answer something, then the few of us who already use StackOverflow can 
up vote you to get you to the minimum rep score to escape them!


Nick


Re: [VOTE] Moving SCM to Git

2016-01-13 Thread Konstantin Gribov
Hi.

[x] +1 Move the Apache Tika source control to Writeable Git repos
at the ASF
[ ] +0 Indifferent.
[ ] -1 Don’t move the Apache Tika source control to Writeable Git
repos at the ASF because..

For me git is more convenient (and I actually use git-svn for svn repos).

пн, 11 янв. 2016 г. в 19:32, Nick Burch :

> On 02/01/16 04:30, Mattmann, Chris A (3980) wrote:
> > Hi Everyone,
> >
> > DISCUSS thread here: http://s.apache.org/wVE
> >
> > Time to officially VOTE on moving Tika to Git. I’ve made a wiki
> > page for our SCM explaining how to use Git at Apache, and how to
> > use it with Github, and how to use it even in a traditional SVN
> > sense. The page is here:
> >
> > https://wiki.apache.org/tika/UsingGit
> >
> > https://wiki.apache.org/tika/DeveloperResources
> > https://wiki.apache.org/tika/ReleaseProcess
>
> Thanks for all those docs! Looks fine to me, at first glance, and we can
> fix anything else as we go along :)
>
> > If you’d like to revise your VOTE or to VOTE for the first time,
> > please use the ballot below:
> >
> > [ ] +1 Move the Apache Tika source control to Writeable Git repos
> > at the ASF
>
> +1 from me now
>
> Nick
>
-- 
Best regards,
Konstantin Gribov


[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1830:
--
Issue Type: Improvement  (was: Bug)

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1830:
--
Priority: Minor  (was: Major)

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1830:
--
Attachment: reports_pdfbox_1_8_11-rc1.zip

Reports on 1.8.11-rc1

Caveats: 
# I haven't reviewed these results much at all, there may be errors in the 
report generation.
# I haven't yet removed all truncated files from our common crawl slice.
# Because of static font caching, there may be ghosts of multithreading

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096866#comment-15096866
 ] 

Tilman Hausherr commented on TIKA-1830:
---

I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2016-01-13 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097245#comment-15097245
 ] 

Stefano Fornari commented on TIKA-1436:
---

Thanks for the feedback Tim.
I'll work the trunk code and produce a new patch in the next days. I will 
address your question too.

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.12
>
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096229#comment-15096229
 ] 

Tim Allison edited comment on TIKA-1830 at 1/13/16 3:40 PM:


Reports on 1.8.11-rc1

Caveats: 
# I haven't reviewed these results much at all, there may be errors in the 
report generation.
# I haven't yet removed all truncated files from our common crawl slice.
# Because of static font caching, there may be ghosts of multithreading
# I haven't properly formatted numbers in the xlsx report files; they're 
currently strings.  Will add to my todo list on TIKA-1332.


was (Author: talli...@mitre.org):
Reports on 1.8.11-rc1

Caveats: 
# I haven't reviewed these results much at all, there may be errors in the 
report generation.
# I haven't yet removed all truncated files from our common crawl slice.
# Because of static font caching, there may be ghosts of multithreading

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2016-01-13 Thread frank (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frank updated TIKA-1829:

Attachment: TesseractOCRParser.java

Patch File Updated

> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  NPE 
> 
>
> Key: TIKA-1829
> URL: https://issues.apache.org/jira/browse/TIKA-1829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: OSX 10.11
>Reporter: frank
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
> Attachments: TesseractOCRParser.java
>
>
> Just need to add a check on parameter of context.
> 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN  
> o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property
> java.lang.NullPointerException: null
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  ~[tika-parsers-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) 
> [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.java:73)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> 

[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2016-01-13 Thread frank (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frank updated TIKA-1829:

Attachment: (was: TesseractOCRParser.java)

> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  NPE 
> 
>
> Key: TIKA-1829
> URL: https://issues.apache.org/jira/browse/TIKA-1829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: OSX 10.11
>Reporter: frank
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
>
> Just need to add a check on parameter of context.
> 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN  
> o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property
> java.lang.NullPointerException: null
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  ~[tika-parsers-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) 
> [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.java:73)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> 

[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096384#comment-15096384
 ] 

Uwe Schindler commented on TIKA-1830:
-

It would be good to update to 1.8.11 as soon as it is out, because Lucene/Solr 
is affected by PDFBOX-3155: we are testing Java 9 preview builds, and that 
failed because of this bug. For now we disabled the tests around TIKA when 
running with Java 9.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika questions on StackOverflow

2016-01-13 Thread Mattmann, Chris A (3980)
Great post Nick

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, January 13, 2016 at 3:22 AM
To: "dev@tika.apache.org" 
Subject: Tika questions on StackOverflow

>Hi All
>
>This may be old news for some of you, in which case you can skip the
>email, but for others... StackOverflow is a programming-focused question
>and answer site, with excellent google-foo, quite wide use, and growing
>use. At the moment I'd say there's something like a new Tika question a
>day on it, and that number seems to be climbing. (It's quite bursty
>though, 2 one day, nothing for the next few)
>
>Increasingly, new users seem to be turning to StackOverflow to get help
>with projects, learn how to use them etc, in place of joining a mailing
>list and asking a question. There's also a lot of people out there who
>know about Tika, aren't on our lists, but are posting helpful replies
>(answers) to questions on how to use Tika.
>
>(There's also a fair number of useless people asking very basic
>questions, 
>without full information, and without having done any research / checked
>existing questions / checked out site / etc. They tend to get moderated
>down pretty quickly though, or they learn and edit the question)
>
>Because StackOverflow gets a lot of newbie traffic, they have some rules,
>and can be quite strict on enforcing them. A lot stricted than many of
>the 
>other StackExchange network sites, largely because of that traffic. That
>means you will find some restrictions at the start, but they go away
>soon. 
>You do need to be careful to actually answer questions with an answer,
>asking for clarifications or saying "can't help, ask on the list" as an
>answer won't go down well.
>
>
>If you're interested to see what sort of questions there are, see
>http://stackoverflow.com/questions/tagged/apache-tika?sort=newest
>=50
>for what has been asked recently, and
>http://stackoverflow.com/questions/tagged/apache-tika?sort=votes=
>50
>for the most "popular"
>
>
>There are a few of us on StackOverflow already, but you might want to
>join 
>in too. You certainly don't have to! But you might want to, not only to
>help, but also to get bug reports, find out what docs we need to update,
>and maybe even spot people answering who we can ask to join the project.
>
>If you sign up for an account, you can get emails when people ask Tika
>related questions, so you can know to go look if it interests you. To do
>that, go to
> http://stackexchange.com/filters/212512/apache-tika-questions
>On the right it should have an "Email Updates" box, where you can
>subscribe to get emailed for new questions on a timing of your choice
>
>
>If you have questions on using StackOverflow, I'm happy to do my best to
>explain. They have pretty good help/documentation, and they have the
>"meta" site to check policies / why reasons / etc.
>
>You will suffer some restrictions as a new user, but they go away when
>your answers get a few up-votes. Let us know your username if you sign up
>and answer something, then the few of us who already use StackOverflow
>can 
>up vote you to get you to the minimum rep score to escape them!
>
>Nick



RE: Tika questions on StackOverflow

2016-01-13 Thread Allison, Timothy B.
Y, thank you, Nick!

I've been monitoring the Solr user's list when I have time.

Are there other consumer lists we should be following?  Elastic Search?

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Wednesday, January 13, 2016 9:53 AM
To: dev@tika.apache.org
Subject: Re: Tika questions on StackOverflow

Great post Nick

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, January 13, 2016 at 3:22 AM
To: "dev@tika.apache.org" 
Subject: Tika questions on StackOverflow

>Hi All
>
>This may be old news for some of you, in which case you can skip the 
>email, but for others... StackOverflow is a programming-focused 
>question and answer site, with excellent google-foo, quite wide use, 
>and growing use. At the moment I'd say there's something like a new 
>Tika question a day on it, and that number seems to be climbing. (It's 
>quite bursty though, 2 one day, nothing for the next few)
>
>Increasingly, new users seem to be turning to StackOverflow to get help 
>with projects, learn how to use them etc, in place of joining a mailing 
>list and asking a question. There's also a lot of people out there who 
>know about Tika, aren't on our lists, but are posting helpful replies
>(answers) to questions on how to use Tika.
>
>(There's also a fair number of useless people asking very basic 
>questions, without full information, and without having done any 
>research / checked existing questions / checked out site / etc. They 
>tend to get moderated down pretty quickly though, or they learn and 
>edit the question)
>
>Because StackOverflow gets a lot of newbie traffic, they have some 
>rules, and can be quite strict on enforcing them. A lot stricted than 
>many of the other StackExchange network sites, largely because of that 
>traffic. That means you will find some restrictions at the start, but 
>they go away soon.
>You do need to be careful to actually answer questions with an answer, 
>asking for clarifications or saying "can't help, ask on the list" as an 
>answer won't go down well.
>
>
>If you're interested to see what sort of questions there are, see 
>http://stackoverflow.com/questions/tagged/apache-tika?sort=newest
>ize
>=50
>for what has been asked recently, and
>http://stackoverflow.com/questions/tagged/apache-tika?sort=votes
>ze=
>50
>for the most "popular"
>
>
>There are a few of us on StackOverflow already, but you might want to 
>join in too. You certainly don't have to! But you might want to, not 
>only to help, but also to get bug reports, find out what docs we need 
>to update, and maybe even spot people answering who we can ask to join 
>the project.
>
>If you sign up for an account, you can get emails when people ask Tika 
>related questions, so you can know to go look if it interests you. To 
>do that, go to
> http://stackexchange.com/filters/212512/apache-tika-questions
>On the right it should have an "Email Updates" box, where you can 
>subscribe to get emailed for new questions on a timing of your choice
>
>
>If you have questions on using StackOverflow, I'm happy to do my best 
>to explain. They have pretty good help/documentation, and they have the 
>"meta" site to check policies / why reasons / etc.
>
>You will suffer some restrictions as a new user, but they go away when 
>your answers get a few up-votes. Let us know your username if you sign 
>up and answer something, then the few of us who already use 
>StackOverflow can up vote you to get you to the minimum rep score to 
>escape them!
>
>Nick



[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096506#comment-15096506
 ] 

Tim Allison commented on TIKA-1830:
---

[~thetaphi], good to know.  Thank you!

Speaking of integration with Solr, would you have a chance/any interest in 
offering feedback on our initial restructuring of the parser bundles for Tika 
2.0 (TIKA-1824)?  Or more generally, do you and your Solr colleagues have any 
wishes for the 2.0 [roadmap|https://wiki.apache.org/tika/Tika2_0RoadMap]?

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1830:
--
Priority: Major  (was: Minor)

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096663#comment-15096663
 ] 

Uwe Schindler commented on TIKA-1830:
-

bq. Speaking of integration with Solr, would you have a chance/any interest in 
offering feedback on our initial restructuring of the parser bundles for Tika 
2.0 (TIKA-1824)? Or more generally, do you and your Solr colleagues have any 
wishes for the 2.0 roadmap?

As already stated in the past, we would like to only bundle parsers for text 
document formats, because images, class files or else are not really useful for 
indexing by default. Users that want to do this, can still add the missing 
parser bundles and SPI will do the rest. Currently we have disabled some 
parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI 
will disable them automatically (because of ClassNotFoundEx). This was a bit 
rude, but worked.

The reason for this was partly also some version incompatibilities (ASM was old 
in TIKA, Lucene needs newest one), but ASM is not really useful for indexing 
anyways!

In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR 
file which one gets bundled, so we check every release anyways during update.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096668#comment-15096668
 ] 

Uwe Schindler commented on TIKA-1824:
-

Hi, as invited on TIKA-1830, here some comments from Apache Solr:

{quote}
As already stated in the past, we would like to only bundle parsers for text 
document formats, because images, class files or else are not really useful for 
indexing by default. Users that want to do this, can still add the missing 
parser bundles and SPI will do the rest. Currently we have disabled some 
parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI 
will disable them automatically (because of ClassNotFoundEx). This was a bit 
rude, but worked.

The reason for this was partly also some version incompatibilities (ASM was old 
in TIKA, Lucene needs newest one), but ASM is not really useful for indexing 
anyways!

In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR 
file which one gets bundled, so we check every release anyways during update.
{quote}

In addition, it would be a good idea to allow loading the TIKA SPI files in a 
separate classloader (to isolate the parser classes from others). The reason 
for this is JAR hell. If TIKA would load the parsers in its own classloader 
(optionally, e.g. by configuration), we could place all parsers and their 
dependencies in a separate lib directory outside the Solr's lib folder.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)