[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-04-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824837#comment-16824837
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler 
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-486080363
 
 
   Thank you so much for your reply and advice.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Thejan Wijesinghe
>Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tika 1.21?

2019-04-23 Thread Konstantin Gribov
Tim,

I'm +1 since I've pushed TIKA-2555/TIKA-2601. But I'm going to look though
ossindex-maven-plugin:audit results.

Maybe I'll do some cleanup (like using lambdas instead of anonymous
classes, diamond op etc) but that's not a blocker ,)

-- 
Best regards,
Konstantin Gribov.


On Tue, Apr 23, 2019 at 9:04 AM Oleg Tikhonov  wrote:

> +1 to wait if needed.
>
> On Mon, Apr 22, 2019, 23:23 Tim Allison  wrote:
>
> > All,
> >   I just made a bunch of upgrades to our dependencies.  I still want
> > to take a first pass at TIKA-2749...maybe by the end of this week with
> > release process kicking off the following week?  I could start the
> > regression tests now (well, tomorrowish), though, unless anyone has
> > anything they want to get in...I'm happy to wait, though, till next
> > week to start the regression tests.
> >  WDYT?
> >
> >Cheers,
> >
> >Tim
> >
> > On Mon, Apr 8, 2019 at 2:25 PM Oleg Tikhonov 
> > wrote:
> > >
> > > Great!
> > > +1.
> > > Thanks,
> > > Oleg
> > >
> > > On Mon, Apr 8, 2019, 21:11 Tim Allison  wrote:
> > >
> > > > All,
> > > >   PDFBox will be out in a few days, and POI should be out soon as
> > > > well.  I _think_ I'd like to get in a first draft of "auto" mode for
> > > > OCR'ing PDFs (TIKA-2749), but other than that, I'd be willing to run
> a
> > > > release of 1.21 in the next few weeks.
> > > >   WDYT?
> > > >
> > > > Best,
> > > >
> > > >Tim
> > > >
> >
>


[jira] [Resolved] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2566.
-
Resolution: Fixed

> Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in 
> the rest of Tika
> --
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-2566:

Summary: Move logging in tika-core to slf4j-api (with log4j in test scope) 
as we do in the rest of Tika  (was: Move logging in tika-core to log4j via 
slf4j as we do in the rest of Tika)

> Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in 
> the rest of Tika
> --
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2019-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824678#comment-16824678
 ] 

Hudson commented on TIKA-2566:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1649 (See 
[https://builds.apache.org/job/Tika-trunk/1649/])
TIKA-2566 -- migrated tika-core to slf4j (grossws: 
[https://github.com/apache/tika/commit/c797a83db923ceb01c9703f4a8a935c231eb9bd7])
* (edit) tika-core/pom.xml
* (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTikaBinTest.java
* (edit) tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java
* (edit) 
tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java
* (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java
* (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java
* (edit) 
tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java
* (edit) 
tika-nlp/src/main/java/org/apache/tika/parser/recognition/AgeRecogniser.java
* (edit) 
tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java


> Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
> -
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2019-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824660#comment-16824660
 ] 

Hudson commented on TIKA-2566:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #404 (See 
[https://builds.apache.org/job/tika-2.x-windows/404/])
TIKA-2566 -- migrated tika-core to slf4j (grossws: rev 
c797a83db923ceb01c9703f4a8a935c231eb9bd7)
* (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java
* (edit) 
tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java
* (edit) tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java
* (edit) 
tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java
* (edit) tika-core/pom.xml
* (edit) 
tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java
* (edit) 
tika-nlp/src/main/java/org/apache/tika/parser/recognition/AgeRecogniser.java
* (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java
* (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTikaBinTest.java


> Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
> -
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2314.
-
Resolution: Resolved

> Migrate logging to slf4j in master (2.x) branch
> ---
>
> Key: TIKA-2314
> URL: https://issues.apache.org/jira/browse/TIKA-2314
> Project: Tika
>  Issue Type: Improvement
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
>  Labels: logging
> Fix For: 2.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-04-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824602#comment-16824602
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java 
version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-485997043
 
 
   See also Luís Filipe Nassif's input: 
https://issues.apache.org/jira/browse/TIKA-2293?focusedCommentId=15943301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15943301
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Thejan Wijesinghe
>Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-04-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824600#comment-16824600
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java 
version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-485996640
 
 
   This was my position 2 years ago: 
https://issues.apache.org/jira/browse/TIKA-2293?focusedCommentId=15939076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15939076
 
   
   Let us know if anything has changed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Thejan Wijesinghe
>Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824568#comment-16824568
 ] 

Tim Allison commented on TIKA-2849:
---

[~boris-petrov] not for your use case, I agree.  

My initial reluctance to respond usefully to this ticket was that I have 
detect+parse as _the_ paradigmatic use case...it took me a while to fully 
comprehend how awful what we were doing would be for your use case...detection 
on a slow network drive.

So, if you are going to parse the file, too, then it is better to use 
TikaInputStream because that will spool the file (if it doesn't exist) and/or 
reuse the underlying file for detection and parsing.  Also, some _parsers_ 
memorymap the underlying file, so their profile is much kinder on RAM if an 
actual file is available, but for _detection_ alone, we can do better.

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-23 Thread Boris Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824380#comment-16824380
 ] 

Boris Petrov commented on TIKA-2849:


[~talli...@apache.org] Thanks for the great support! Just a last question - in 
that case, I can't see a reason for using {{TikaInputStream}} with a stream. Am 
I right or am I missing something? Why do I have to wrap my stream in an 
{{TikaInputStream}} at all?

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824296#comment-16824296
 ] 

Tim Allison commented on TIKA-2849:
---

default: 
{noformat}
private int markLimit = 16 * 1024 * 1024
{noformat}

You can configure it programmatically or via TikaConfig...something along these 
lines:
{noformat}






16777216




16777216




{noformat}

> TikaInputStream copies the input stream locally
> ---
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Boris Petrov
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Wiki migration

2019-04-23 Thread Tim Allison
I'm in.  Thank you, Konstantin!

On Mon, Apr 22, 2019 at 1:22 PM Tim Allison  wrote:
>
> No luck still.  I'm able to login w my Apache credentials, but I don't
> appear to have permissions to see anything.  Should I open a ticket
> with infra or comment on INFRA-18108?
>
> On Sat, Apr 20, 2019 at 5:10 PM Konstantin Gribov  wrote:
> >
> > Ok, Tim than try again with logout and login to check when you have a
> > moment to do it. I removed explicit permissions now 'cause Gavin said that
> > all Tika committers and PMC are in tika group in cwiki.
> >
> > --
> > Best regards,
> > Konstantin Gribov.
> >
> >
> > On Fri, Apr 19, 2019 at 11:22 PM Tim Allison  wrote:
> >
> > > Doh, I didn't.  I think I was TimothyAllison...
> > >
> > > On Thu, Apr 18, 2019 at 9:50 AM Konstantin Gribov 
> > > wrote:
> > > >
> > > > Tim, did you have account in cwiki before migration with same name as
> > > > apache id? If you did, ask infra to merge them.
> > > > Meanwhile I will add full rights to your account explicitly.
> > > >
> > > > --
> > > > Best regards,
> > > > Konstantin Gribov.
> > > >
> > > >
> > > > On Wed, Apr 17, 2019 at 10:49 PM Tim Allison 
> > > wrote:
> > > >
> > > > > Thank you, Konstantin...would someone be able to grant me karma?
> > > > >
> > > > > The following error(s) occurred:
> > > > >
> > > > > You do not have permission to access
> > > > > /spaces/spacepermissions.action?key=TIKA. To access this page, please
> > > > > log in as a user with sufficient permissions.
> > > > >
> > > > > You are currently logged in as tallison.
> > > > >
> > > > > Click here for the dashboard. To log in as another user, you must 
> > > > > first
> > > > > log out.
> > > > >
> > > > > On Wed, Apr 17, 2019 at 2:18 PM Ken Krugler <
> > > kkrugler_li...@transpac.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi Konstantin,
> > > > > >
> > > > > > I was able to log in with my Apache LDAP credentials, and modify
> > > pages.
> > > > > >
> > > > > > Thanks for getting this started!
> > > > > >
> > > > > > — Ken
> > > > > >
> > > > > > > On Apr 17, 2019, at 7:05 AM, Konstantin Gribov 
> > > > > wrote:
> > > > > > >
> > > > > > > Hi, folks.
> > > > > > >
> > > > > > > It seems basic migration completed but quite a bit of page
> > > > > reorganization
> > > > > > > required.
> > > > > > >
> > > > > > > Also, I need someone from PMC or committers (but no me or Chris) 
> > > > > > > to
> > > > > log in
> > > > > > > with ASF LDAP account to test that full permissions to TIKA space
> > > is
> > > > > > > granted (e.g. check if [1] works for you).
> > > > > > > Infra team added tika group from ldap to confluence (see
> > > INFRA-18108).
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > >
> > > https://cwiki.apache.org/confluence/spaces/spacepermissions.action?key=TIKA
> > > > > > >
> > > > > > > P. S. Chris, is chrismattmann your legitimate account there? Will
> > > you
> > > > > merge
> > > > > > > it with your LDAP account via INFRA ticket later?
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Konstantin Gribov.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Mar 21, 2019 at 11:56 PM Tim Allison 
> > > > > wrote:
> > > > > > >
> > > > > > >> +1 let me know what I need to do.
> > > > > > >>
> > > > > > >> On Thu, Mar 21, 2019 at 1:02 PM Konstantin Gribov <
> > > gros...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>> Hi, folks
> > > > > > >>>
> > > > > > >>> What do you think about starting wiki migration (from moin to
> > > > > > >> confluence)?
> > > > > > >>>
> > > > > > >>> I can try it via selfservice.a.o if you consent but I'm not sure
> > > if I
> > > > > > >> have
> > > > > > >>> enough access to do so. Maybe only Tim as PMC Chair can.
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Best regards,
> > > > > > >>> Konstantin Gribov.
> > > > > > >>
> > > > > >
> > > > > > --
> > > > > > Ken Krugler
> > > > > > +1 530-210-6378
> > > > > > http://www.scaleunlimited.com
> > > > > > Custom big data solutions & training
> > > > > > Flink, Solr, Hadoop, Cascading & Cassandra
> > > > > >
> > > > >
> > >


[jira] [Closed] (TIKA-2315) Update logging page at wiki with actual info

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-2315.
---

> Update logging page at wiki with actual info
> 
>
> Key: TIKA-2315
> URL: https://issues.apache.org/jira/browse/TIKA-2315
> Project: Tika
>  Issue Type: Task
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Minor
>  Labels: logging
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2315) Update logging page at wiki with actual info

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2315.
-
Resolution: Fixed

> Update logging page at wiki with actual info
> 
>
> Key: TIKA-2315
> URL: https://issues.apache.org/jira/browse/TIKA-2315
> Project: Tika
>  Issue Type: Task
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Minor
>  Labels: logging
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-2314:

Summary: Migrate logging to slf4j in master (2.x) branch  (was: Migrate 
logging to slf4j in 2.x branch)

> Migrate logging to slf4j in master (2.x) branch
> ---
>
> Key: TIKA-2314
> URL: https://issues.apache.org/jira/browse/TIKA-2314
> Project: Tika
>  Issue Type: Improvement
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
>  Labels: logging
> Fix For: 2.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-2566:

Fix Version/s: (was: 1.20)

> Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
> -
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2566.
-
   Resolution: Fixed
Fix Version/s: 1.20

> Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
> -
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0, 1.20
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov reopened TIKA-2566:
-

> Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
> -
>
> Key: TIKA-2566
> URL: https://issues.apache.org/jira/browse/TIKA-2566
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0.0, 1.20
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2854) upgrade out-of-date dependencies with outstanding CVEs

2019-04-23 Thread Andrew Pavlin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824212#comment-16824212
 ] 

Andrew Pavlin commented on TIKA-2854:
-

Regarding your question on the ucar versions, I got those version numbers from 
the ucar.edu website. Not sure why (at the time of my bug submittal) they 
listed different version numbers for the different software modules, but even 
their examples for POM files show using common version numbers for netcdf, cdm, 
grib. I may also have been reading an incorrect page, as they now say the 
current version is consistently 4.6.13 as of today.

> upgrade out-of-date dependencies with outstanding CVEs
> --
>
> Key: TIKA-2854
> URL: https://issues.apache.org/jira/browse/TIKA-2854
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, parser
>Affects Versions: 1.20
>Reporter: Andrew Pavlin
>Priority: Major
>
> Besides the libraries reported in TIKA-2801 and TIKA-2835, the following 4th 
> party dependencies are out-of-date and should be upgraded to the latest 
> versions. The first three have outstanding CVEs which would be resolved by 
> using the newer versions of those dependencies.
> jackson-databind (is 2.9.7, should be 2.9.8)
> guava (is 17.0, should be 27.0)
> sqlite-jdbc (is 3.25.2, should be 3.27.2.1)
> No current CVEs but still out-of-date:
> Apache commons-codec (is 1.11, should be 1.12)
> Apache CXF (is 3.2.7, should be 3.3.1)
> Apache httpcomponents (is 4.5.6, should be 4.5.8)
> Apache james mime4j (is 0.8.2, should be 0.8.3)
> Apache opennlp-tools (is 1.9.0, should be 1.9.1)
> parso (is 2.0.10, should be  2.0.11)
> jackson-annotations
> jackson-core
> jackcess (is 2.1.12, should be 3.0.0)
> jackcess-encrypt (is 2.1.4, should be 3.0.0)
> org.osgi.compendium (is 4.0.0, should be 5.0.0)
> org.osgi.core (is 4.0.0, should be 6.0.0)
> junrar (is 2.0.0, should be 4.0.0)
> java-libpst (is 0.8.1, should be 0.9.3)
> jna (is 5.1.0, should be 5.2.0)
> Bouncy Castle bcprov and bcmail (is 1.60, should be 1.61)
> slf4j-log4j12 (is 1.7.25, should be 1.7.26)
> UCAR cdm (is 4.5.5, should be 5.0.0)
> UCAR grib (is 4.5.5, should be 8.0.0)
> UCAR httpservices (is 4.5.5, should be 4.6.7)
> UCAR netcdf4 (incorrectly labeled as 4.5.5, should be 4.3.22)
> bndlib (is 1.50.0, should be 4.2.0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2555.
-
   Resolution: Fixed
Fix Version/s: 1.21
   2.0

> Text with [underline] + [another format] in word document generates 
> overlapping html tags.
> --
>
> Key: TIKA-2555
> URL: https://issues.apache.org/jira/browse/TIKA-2555
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Serban Alexe
>Assignee: Konstantin Gribov
>Priority: Minor
> Fix For: 2.0, 1.21
>
> Attachments: Clipboard02.jpg
>
>
> I have a sample _.docx_ document which contains one single line of text**++.
> Making that text to be:
>  * +underlined+
>  ** AND at least one of the following two
>  * _italic_
>  * *bold*
> will cause the generated _.xhtml_ file to contain overlapping tags.
>  
> _+Example+_:
> *+The quick brown fox jumps over the lazy dog.+*
> will result in
> The quick brown fox jumps over the lazy dog. 
> which causes some browser (Firefox, Chrome) to give an error and not display 
> the content of the file...
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents

2019-04-23 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2601.
-
   Resolution: Fixed
Fix Version/s: 1.21
   2.0

> Invalid XHTML output (overlapping a and formatting tags) for some WORD 
> documents
> 
>
> Key: TIKA-2601
> URL: https://issues.apache.org/jira/browse/TIKA-2601
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
> Environment: Linked is a sample document with its corresponding 
> output.
>Reporter: Filip
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0, 1.21
>
> Attachments: Invalid-XML.doc, Test.doc, test.html
>
>
> In some WORD (.doc, .docx) documents the XHTML elements are not closed 
> properly. This usually happens when there are link elements () as well as 
> italic or bold elements ().
>  
> Fix should be done in 
> [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2841) Improve robustness of parsers of zip-based files on truncated files

2019-04-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2841.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.21
   2.0.0

> Improve robustness of parsers of zip-based files on truncated files
> ---
>
> Key: TIKA-2841
> URL: https://issues.apache.org/jira/browse/TIKA-2841
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0, 1.21
>
> Attachments: truncated_1.zip, truncated_3.zip
>
>
> We've done some work on this with docx, etc, but we can do more with epub and 
> open office, and, frankly msoffice as well.  We should also improve the 
> ContainerDetector to work more robustly with truncated zips.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

2019-04-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823931#comment-16823931
 ] 

ASF GitHub Bot commented on TIKA-2293:
--

changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler 
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-485743349
 
 
   Hello!I wonder why tess4jocrparser.java doesn't have this implementation in 
the current apache TIka version.thank you
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Thejan Wijesinghe
>Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)