[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824837#comment-16824837 ] ASF GitHub Bot commented on TIKA-2293: -- changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-486080363 Thank you so much for your reply and advice. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tess4jOCRParser - A simpler Java version of TesseractOCRParser > --- > > Key: TIKA-2293 > URL: https://issues.apache.org/jira/browse/TIKA-2293 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Thejan Wijesinghe >Priority: Major > > Right now, TesseractOCRParser calls tesseract and imagemagick from command > line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API > instead of the runtime.exec way to executing tesseract out of process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tika 1.21?
Tim, I'm +1 since I've pushed TIKA-2555/TIKA-2601. But I'm going to look though ossindex-maven-plugin:audit results. Maybe I'll do some cleanup (like using lambdas instead of anonymous classes, diamond op etc) but that's not a blocker ,) -- Best regards, Konstantin Gribov. On Tue, Apr 23, 2019 at 9:04 AM Oleg Tikhonov wrote: > +1 to wait if needed. > > On Mon, Apr 22, 2019, 23:23 Tim Allison wrote: > > > All, > > I just made a bunch of upgrades to our dependencies. I still want > > to take a first pass at TIKA-2749...maybe by the end of this week with > > release process kicking off the following week? I could start the > > regression tests now (well, tomorrowish), though, unless anyone has > > anything they want to get in...I'm happy to wait, though, till next > > week to start the regression tests. > > WDYT? > > > >Cheers, > > > >Tim > > > > On Mon, Apr 8, 2019 at 2:25 PM Oleg Tikhonov > > wrote: > > > > > > Great! > > > +1. > > > Thanks, > > > Oleg > > > > > > On Mon, Apr 8, 2019, 21:11 Tim Allison wrote: > > > > > > > All, > > > > PDFBox will be out in a few days, and POI should be out soon as > > > > well. I _think_ I'd like to get in a first draft of "auto" mode for > > > > OCR'ing PDFs (TIKA-2749), but other than that, I'd be willing to run > a > > > > release of 1.21 in the next few weeks. > > > > WDYT? > > > > > > > > Best, > > > > > > > >Tim > > > > > > >
[jira] [Resolved] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2566. - Resolution: Fixed > Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in > the rest of Tika > -- > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2566: Summary: Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika (was: Move logging in tika-core to log4j via slf4j as we do in the rest of Tika) > Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in > the rest of Tika > -- > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824678#comment-16824678 ] Hudson commented on TIKA-2566: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1649 (See [https://builds.apache.org/job/Tika-trunk/1649/]) TIKA-2566 -- migrated tika-core to slf4j (grossws: [https://github.com/apache/tika/commit/c797a83db923ceb01c9703f4a8a935c231eb9bd7]) * (edit) tika-core/pom.xml * (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTikaBinTest.java * (edit) tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java * (edit) tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java * (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java * (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java * (edit) tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java * (edit) tika-nlp/src/main/java/org/apache/tika/parser/recognition/AgeRecogniser.java * (edit) tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824660#comment-16824660 ] Hudson commented on TIKA-2566: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #404 (See [https://builds.apache.org/job/tika-2.x-windows/404/]) TIKA-2566 -- migrated tika-core to slf4j (grossws: rev c797a83db923ceb01c9703f4a8a935c231eb9bd7) * (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java * (edit) tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java * (edit) tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java * (edit) tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java * (edit) tika-core/pom.xml * (edit) tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java * (edit) tika-nlp/src/main/java/org/apache/tika/parser/recognition/AgeRecogniser.java * (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java * (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTikaBinTest.java > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch
[ https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2314. - Resolution: Resolved > Migrate logging to slf4j in master (2.x) branch > --- > > Key: TIKA-2314 > URL: https://issues.apache.org/jira/browse/TIKA-2314 > Project: Tika > Issue Type: Improvement >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 2.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824602#comment-16824602 ] ASF GitHub Bot commented on TIKA-2293: -- tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-485997043 See also Luís Filipe Nassif's input: https://issues.apache.org/jira/browse/TIKA-2293?focusedCommentId=15943301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15943301 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tess4jOCRParser - A simpler Java version of TesseractOCRParser > --- > > Key: TIKA-2293 > URL: https://issues.apache.org/jira/browse/TIKA-2293 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Thejan Wijesinghe >Priority: Major > > Right now, TesseractOCRParser calls tesseract and imagemagick from command > line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API > instead of the runtime.exec way to executing tesseract out of process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824600#comment-16824600 ] ASF GitHub Bot commented on TIKA-2293: -- tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-485996640 This was my position 2 years ago: https://issues.apache.org/jira/browse/TIKA-2293?focusedCommentId=15939076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15939076 Let us know if anything has changed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tess4jOCRParser - A simpler Java version of TesseractOCRParser > --- > > Key: TIKA-2293 > URL: https://issues.apache.org/jira/browse/TIKA-2293 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Thejan Wijesinghe >Priority: Major > > Right now, TesseractOCRParser calls tesseract and imagemagick from command > line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API > instead of the runtime.exec way to executing tesseract out of process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824568#comment-16824568 ] Tim Allison commented on TIKA-2849: --- [~boris-petrov] not for your use case, I agree. My initial reluctance to respond usefully to this ticket was that I have detect+parse as _the_ paradigmatic use case...it took me a while to fully comprehend how awful what we were doing would be for your use case...detection on a slow network drive. So, if you are going to parse the file, too, then it is better to use TikaInputStream because that will spool the file (if it doesn't exist) and/or reuse the underlying file for detection and parsing. Also, some _parsers_ memorymap the underlying file, so their profile is much kinder on RAM if an actual file is available, but for _detection_ alone, we can do better. > TikaInputStream copies the input stream locally > --- > > Key: TIKA-2849 > URL: https://issues.apache.org/jira/browse/TIKA-2849 > Project: Tika > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Boris Petrov >Assignee: Tim Allison >Priority: Major > Fix For: 1.21 > > > When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", > execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, > path, REPLACE_EXISTING);" which is very, very bad. This input stream could > be, as in our case, an input stream from a network file which is tens or > hundreds of gigabytes large. Copying it locally is a huge waste of resources > to say the least. Why does it do that and can I make it not do it? Or is this > something that has to be fixed in Tika? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824380#comment-16824380 ] Boris Petrov commented on TIKA-2849: [~talli...@apache.org] Thanks for the great support! Just a last question - in that case, I can't see a reason for using {{TikaInputStream}} with a stream. Am I right or am I missing something? Why do I have to wrap my stream in an {{TikaInputStream}} at all? > TikaInputStream copies the input stream locally > --- > > Key: TIKA-2849 > URL: https://issues.apache.org/jira/browse/TIKA-2849 > Project: Tika > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Boris Petrov >Assignee: Tim Allison >Priority: Major > Fix For: 1.21 > > > When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", > execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, > path, REPLACE_EXISTING);" which is very, very bad. This input stream could > be, as in our case, an input stream from a network file which is tens or > hundreds of gigabytes large. Copying it locally is a huge waste of resources > to say the least. Why does it do that and can I make it not do it? Or is this > something that has to be fixed in Tika? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824296#comment-16824296 ] Tim Allison commented on TIKA-2849: --- default: {noformat} private int markLimit = 16 * 1024 * 1024 {noformat} You can configure it programmatically or via TikaConfig...something along these lines: {noformat} 16777216 16777216 {noformat} > TikaInputStream copies the input stream locally > --- > > Key: TIKA-2849 > URL: https://issues.apache.org/jira/browse/TIKA-2849 > Project: Tika > Issue Type: Bug >Affects Versions: 1.20 >Reporter: Boris Petrov >Assignee: Tim Allison >Priority: Major > Fix For: 1.21 > > > When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", > execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, > path, REPLACE_EXISTING);" which is very, very bad. This input stream could > be, as in our case, an input stream from a network file which is tens or > hundreds of gigabytes large. Copying it locally is a huge waste of resources > to say the least. Why does it do that and can I make it not do it? Or is this > something that has to be fixed in Tika? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Wiki migration
I'm in. Thank you, Konstantin! On Mon, Apr 22, 2019 at 1:22 PM Tim Allison wrote: > > No luck still. I'm able to login w my Apache credentials, but I don't > appear to have permissions to see anything. Should I open a ticket > with infra or comment on INFRA-18108? > > On Sat, Apr 20, 2019 at 5:10 PM Konstantin Gribov wrote: > > > > Ok, Tim than try again with logout and login to check when you have a > > moment to do it. I removed explicit permissions now 'cause Gavin said that > > all Tika committers and PMC are in tika group in cwiki. > > > > -- > > Best regards, > > Konstantin Gribov. > > > > > > On Fri, Apr 19, 2019 at 11:22 PM Tim Allison wrote: > > > > > Doh, I didn't. I think I was TimothyAllison... > > > > > > On Thu, Apr 18, 2019 at 9:50 AM Konstantin Gribov > > > wrote: > > > > > > > > Tim, did you have account in cwiki before migration with same name as > > > > apache id? If you did, ask infra to merge them. > > > > Meanwhile I will add full rights to your account explicitly. > > > > > > > > -- > > > > Best regards, > > > > Konstantin Gribov. > > > > > > > > > > > > On Wed, Apr 17, 2019 at 10:49 PM Tim Allison > > > wrote: > > > > > > > > > Thank you, Konstantin...would someone be able to grant me karma? > > > > > > > > > > The following error(s) occurred: > > > > > > > > > > You do not have permission to access > > > > > /spaces/spacepermissions.action?key=TIKA. To access this page, please > > > > > log in as a user with sufficient permissions. > > > > > > > > > > You are currently logged in as tallison. > > > > > > > > > > Click here for the dashboard. To log in as another user, you must > > > > > first > > > > > log out. > > > > > > > > > > On Wed, Apr 17, 2019 at 2:18 PM Ken Krugler < > > > kkrugler_li...@transpac.com> > > > > > wrote: > > > > > > > > > > > > Hi Konstantin, > > > > > > > > > > > > I was able to log in with my Apache LDAP credentials, and modify > > > pages. > > > > > > > > > > > > Thanks for getting this started! > > > > > > > > > > > > — Ken > > > > > > > > > > > > > On Apr 17, 2019, at 7:05 AM, Konstantin Gribov > > > > > wrote: > > > > > > > > > > > > > > Hi, folks. > > > > > > > > > > > > > > It seems basic migration completed but quite a bit of page > > > > > reorganization > > > > > > > required. > > > > > > > > > > > > > > Also, I need someone from PMC or committers (but no me or Chris) > > > > > > > to > > > > > log in > > > > > > > with ASF LDAP account to test that full permissions to TIKA space > > > is > > > > > > > granted (e.g. check if [1] works for you). > > > > > > > Infra team added tika group from ldap to confluence (see > > > INFRA-18108). > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/spaces/spacepermissions.action?key=TIKA > > > > > > > > > > > > > > P. S. Chris, is chrismattmann your legitimate account there? Will > > > you > > > > > merge > > > > > > > it with your LDAP account via INFRA ticket later? > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Konstantin Gribov. > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 21, 2019 at 11:56 PM Tim Allison > > > > > wrote: > > > > > > > > > > > > > >> +1 let me know what I need to do. > > > > > > >> > > > > > > >> On Thu, Mar 21, 2019 at 1:02 PM Konstantin Gribov < > > > gros...@gmail.com> > > > > > > >> wrote: > > > > > > >>> > > > > > > >>> Hi, folks > > > > > > >>> > > > > > > >>> What do you think about starting wiki migration (from moin to > > > > > > >> confluence)? > > > > > > >>> > > > > > > >>> I can try it via selfservice.a.o if you consent but I'm not sure > > > if I > > > > > > >> have > > > > > > >>> enough access to do so. Maybe only Tim as PMC Chair can. > > > > > > >>> > > > > > > >>> -- > > > > > > >>> Best regards, > > > > > > >>> Konstantin Gribov. > > > > > > >> > > > > > > > > > > > > -- > > > > > > Ken Krugler > > > > > > +1 530-210-6378 > > > > > > http://www.scaleunlimited.com > > > > > > Custom big data solutions & training > > > > > > Flink, Solr, Hadoop, Cascading & Cassandra > > > > > > > > > > > > > >
[jira] [Closed] (TIKA-2315) Update logging page at wiki with actual info
[ https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2315. --- > Update logging page at wiki with actual info > > > Key: TIKA-2315 > URL: https://issues.apache.org/jira/browse/TIKA-2315 > Project: Tika > Issue Type: Task >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Labels: logging > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2315) Update logging page at wiki with actual info
[ https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2315. - Resolution: Fixed > Update logging page at wiki with actual info > > > Key: TIKA-2315 > URL: https://issues.apache.org/jira/browse/TIKA-2315 > Project: Tika > Issue Type: Task >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Labels: logging > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch
[ https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2314: Summary: Migrate logging to slf4j in master (2.x) branch (was: Migrate logging to slf4j in 2.x branch) > Migrate logging to slf4j in master (2.x) branch > --- > > Key: TIKA-2314 > URL: https://issues.apache.org/jira/browse/TIKA-2314 > Project: Tika > Issue Type: Improvement >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 2.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2566: Fix Version/s: (was: 1.20) > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2566. - Resolution: Fixed Fix Version/s: 1.20 > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0, 1.20 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reopened TIKA-2566: - > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0, 1.20 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2854) upgrade out-of-date dependencies with outstanding CVEs
[ https://issues.apache.org/jira/browse/TIKA-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824212#comment-16824212 ] Andrew Pavlin commented on TIKA-2854: - Regarding your question on the ucar versions, I got those version numbers from the ucar.edu website. Not sure why (at the time of my bug submittal) they listed different version numbers for the different software modules, but even their examples for POM files show using common version numbers for netcdf, cdm, grib. I may also have been reading an incorrect page, as they now say the current version is consistently 4.6.13 as of today. > upgrade out-of-date dependencies with outstanding CVEs > -- > > Key: TIKA-2854 > URL: https://issues.apache.org/jira/browse/TIKA-2854 > Project: Tika > Issue Type: Bug > Components: languageidentifier, parser >Affects Versions: 1.20 >Reporter: Andrew Pavlin >Priority: Major > > Besides the libraries reported in TIKA-2801 and TIKA-2835, the following 4th > party dependencies are out-of-date and should be upgraded to the latest > versions. The first three have outstanding CVEs which would be resolved by > using the newer versions of those dependencies. > jackson-databind (is 2.9.7, should be 2.9.8) > guava (is 17.0, should be 27.0) > sqlite-jdbc (is 3.25.2, should be 3.27.2.1) > No current CVEs but still out-of-date: > Apache commons-codec (is 1.11, should be 1.12) > Apache CXF (is 3.2.7, should be 3.3.1) > Apache httpcomponents (is 4.5.6, should be 4.5.8) > Apache james mime4j (is 0.8.2, should be 0.8.3) > Apache opennlp-tools (is 1.9.0, should be 1.9.1) > parso (is 2.0.10, should be 2.0.11) > jackson-annotations > jackson-core > jackcess (is 2.1.12, should be 3.0.0) > jackcess-encrypt (is 2.1.4, should be 3.0.0) > org.osgi.compendium (is 4.0.0, should be 5.0.0) > org.osgi.core (is 4.0.0, should be 6.0.0) > junrar (is 2.0.0, should be 4.0.0) > java-libpst (is 0.8.1, should be 0.9.3) > jna (is 5.1.0, should be 5.2.0) > Bouncy Castle bcprov and bcmail (is 1.60, should be 1.61) > slf4j-log4j12 (is 1.7.25, should be 1.7.26) > UCAR cdm (is 4.5.5, should be 5.0.0) > UCAR grib (is 4.5.5, should be 8.0.0) > UCAR httpservices (is 4.5.5, should be 4.6.7) > UCAR netcdf4 (incorrectly labeled as 4.5.5, should be 4.3.22) > bndlib (is 1.50.0, should be 4.2.0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.
[ https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2555. - Resolution: Fixed Fix Version/s: 1.21 2.0 > Text with [underline] + [another format] in word document generates > overlapping html tags. > -- > > Key: TIKA-2555 > URL: https://issues.apache.org/jira/browse/TIKA-2555 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 >Reporter: Serban Alexe >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0, 1.21 > > Attachments: Clipboard02.jpg > > > I have a sample _.docx_ document which contains one single line of text**++. > Making that text to be: > * +underlined+ > ** AND at least one of the following two > * _italic_ > * *bold* > will cause the generated _.xhtml_ file to contain overlapping tags. > > _+Example+_: > *+The quick brown fox jumps over the lazy dog.+* > will result in > The quick brown fox jumps over the lazy dog. > which causes some browser (Firefox, Chrome) to give an error and not display > the content of the file... > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2601. - Resolution: Fixed Fix Version/s: 1.21 2.0 > Invalid XHTML output (overlapping a and formatting tags) for some WORD > documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.21 > > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2841) Improve robustness of parsers of zip-based files on truncated files
[ https://issues.apache.org/jira/browse/TIKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2841. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 1.21 2.0.0 > Improve robustness of parsers of zip-based files on truncated files > --- > > Key: TIKA-2841 > URL: https://issues.apache.org/jira/browse/TIKA-2841 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.0, 1.21 > > Attachments: truncated_1.zip, truncated_3.zip > > > We've done some work on this with docx, etc, but we can do more with epub and > open office, and, frankly msoffice as well. We should also improve the > ContainerDetector to work more robustly with truncated zips. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823931#comment-16823931 ] ASF GitHub Bot commented on TIKA-2293: -- changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-485743349 Hello!I wonder why tess4jocrparser.java doesn't have this implementation in the current apache TIka version.thank you This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tess4jOCRParser - A simpler Java version of TesseractOCRParser > --- > > Key: TIKA-2293 > URL: https://issues.apache.org/jira/browse/TIKA-2293 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Thejan Wijesinghe >Priority: Major > > Right now, TesseractOCRParser calls tesseract and imagemagick from command > line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API > instead of the runtime.exec way to executing tesseract out of process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)