[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864742#comment-17864742
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2221002267

   Will do. Thank you for the help.
   
   With the above `commons-io` suggestion, everything looks good now. Will be 
doing more testing.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864561#comment-17864561
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2219959205

   This is really getting off-topic, please post to the tika users mailing list 
(don't forget to subscribe)
   https://lists.apache.org/list.html?u...@tika.apache.orgsee bottom left
   or ask on stackoverflow




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864450#comment-17864450
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2219468516

   It worked for me with small changes because your code isn't runnable:
   
   ```
   Path input = Paths.get("samplepptx.pptx");
   Writer writer = new OutputStreamWriter(System.out);
   ```
   The file is
   https://scholar.harvard.edu/files/torman_personal/files/samplepptx.pptx
   
   Maybe you have another commons-io dependency in your pom.xml. Use this:
   
   ```
   
   commons-io
   commons-io
   2.16.1
   
   ```
   
   Use the maven versions plugin to check that you're up-to-date:
   ```
   
   
   org.codehaus.mojo
   versions-maven-plugin
   
   
   install
   
   display-plugin-updates
   display-dependency-updates
   display-property-updates
   
   
   
   
   true
   false
   
   
   ```
   
   My output:
   
   Sample PowerPoint File
   St. Cloud Technical College
   
   
   This is a Sample Slide
   Here is an outline of bulleted points
   You can print out PPT files as handouts using the 
   PRINT > 
 PRINT WHAT > HANDOUTS option
   
   
   image2.jpeg
   
   image1.jpeg




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864420#comment-17864420
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2218995629

   Thank you. That worked but I bumped into a new issue now after working 
through few other huccups. 
   I am trying to parse a ppt file.
   
   ```
   import org.apache.tika.io.TikaInputStream;
   import org.apache.tika.metadata.Metadata;
   import org.apache.tika.parser.AutoDetectParser;
   import org.apache.tika.parser.ParseContext;
   import org.apache.tika.parser.Parser;
   import org.apache.tika.sax.BodyContentHandler;
   import org.apache.tika.sax.OfflineContentHandler;
   import org.apache.tika.parser.ocr.TesseractOCRConfig;
   
   TesseractOCRConfig config = new TesseractOCRConfig();
   config.setSkipOcr(true);
   ParseContext context = new ParseContext();
   context.set(TesseractOCRConfig.class, config);
   
   Parser parser = new AutoDetectParser();
   Metadata metadata = new Metadata();
   OfflineContentHandler handler = new OfflineContentHandler(new 
BodyContentHandler(writer));
   
   // Note: here we have to use TikaInputStream.get, otherwise certain 
content type (e.g. 2007
   // pptx) might not be correctly detected by the parser
   try (InputStream original = TikaInputStream.get(input, metadata)) {
 parser.parse(original, handler, metadata, context); 
==> Above call is crashing with
   Execution error (NoSuchMethodError) at 
org.apache.poi.util.IOUtils/toByteArray (IOUtils.java:241).
   'org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream$Builder 
org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.builder()'
   }
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863998#comment-17863998
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2216338193

   Try `TikaCoreProperties` instead.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863981#comment-17863981
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2216056172

   I had to revert to `3.0.0-BETA` instead of `3.0.0-SNAPSHOT` due to 
dependencies in our code.
   
   Running into this issue when I use `BETA` version. 
   
   ```
   
   [2024-07-08T22:29:48Z] Execution error (ClassNotFoundException) at 
java.net.URLClassLoader/findClass (URLClassLoader.java:476).
   [2024-07-08T22:29:48Z] org.apache.tika.metadata.TikaMetadataKeys
   
   ```
   
   I am using -
   ```
   org.apache.tika
 tika-core
 3.0.0-BETA
   org.apache.tika
 tika-parsers-standard-package
 3.0.0-BETA
   ```
   




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863932#comment-17863932
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2215277124

   Thank you @tballison . This worked. However, we have a lot of dependencies 
on the version to be a release.
   Any idea when new TIKA version be released? just so that we can put it in 
the plan to upgrade Tika to a new release that is compatible with pdfbox 3.x.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863863#comment-17863863
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2214741939

   It's not in maven central. Add this to your pom.xml
   ```
   
   
   id1
   https://repository.apache.org/snapshots/
   
   
   id2
   
https://repository.apache.org/content/repositories/snapshots
   
   
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863858#comment-17863858
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2214690992

   But which repo should I point to in `.pom` file?
   
   I tried using 3.0.0-SNAPSHOT or 3.0.0 in .pom file but can't find it.
   
   ```
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in central 
(https://repo1.maven.org/maven2/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in clojars 
(https://repo.clojars.org/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in 
sonatype-oss-public (https://oss.sonatype.org/content/groups/public/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in 
maven.alfresco.com (https://maven.alfresco.com/nexus/content/groups/public/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in central 
(https://repo1.maven.org/maven2/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in clojars 
(https://repo.clojars.org/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in sonatype-oss-public 
(https://oss.sonatype.org/content/groups/public/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in maven.alfresco.com 
(https://maven.alfresco.com/nexus/content/groups/public/)
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863660#comment-17863660
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2212942831

   There's been plans to do another alpha soon. Snapshots are here:
   
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/3.0.0-SNAPSHOT/




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863626#comment-17863626
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2212556752

   Thank you @THausherr.
   These errors are occuring for a bad pdf file. Even with these errors 
ignoring and repairing, we are able to process it fine now.
   
   Any idea when we plan to release this TIKA version?
   
   Until its released, how can I pull from snapshots? 
https://github.com/apache/tika/pull/1473#issuecomment-2207307712




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863430#comment-17863430
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2211609822

   There has been a complaint about the NISC18030.ttf font in the past: 
PDFBOX-5743, and I can see in the browser that I searched for it and found it 
at https://github.com/justrajdeep/fonts/blob/master/NISC18030.ttf , and another 
complaint at https://github.com/XQuartz/XQuartz/issues/304 . I downloaded it 
again and there is indeed no "head" table. (And no glyf table either). It does 
have a name table and the font is "GB18030 Bitmap". The failure (i.e. that the 
entire scan fails) has been fixed some time ago, definitively in 3.0.2.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863407#comment-17863407
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2211353100

   yeah, I see. There are more of these. Since `ioexception` is thrown, we are 
failing. Is there anything I can do to avoid these errors? why are they 
occuring?
   
   
   ```
   23:32:53.243 WARN i[Content] [main] j[6685831e5864fc77b5140f4a] 
o.a.p.p.f.FileSystemFontProvider 368 new fonts found, font cache will be 
re-built
   23:32:53.243 WARN i[Content] [main] j[6685831e5864fc77b5140f4a] 
o.a.p.p.f.FileSystemFontProvider Building on-disk font cache, this may take a 
while
   23:32:54.101 WARN i[Content] [main] j[6685831e5864fc77b5140f4a] 
o.a.p.p.f.FileSystemFontProvider Could not load font file: 
/System/Library/Fonts/LastResort.otf
   ```
   
   ```
   23:32:55.670 WARN i[Content] [main] j[6685831e5864fc77b5140f4a] 
o.a.p.p.f.FileSystemFontProvider Could not load font file: 
/System/Library/Fonts/Supplemental/NISC18030.ttf
   java.io.IOException: 'head' table is mandatory
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862912#comment-17862912
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

THausherr commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207940611

   Please look at the rest of the log output. IIRC this is a problem with 
`lastresortfont.otf` when the initial scanning is done. But that font is 
skipped and life continues.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862894#comment-17862894
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207526085

   I was able to use tika `3.0.0-BETA` and the `pdfbox` is at `3.0.2`.
   
   Seeing this issue - any ideas? am I missing anything?
   
   ```
   java.io.IOException: Invalid character code 0xD800
at 
org.apache.fontbox.ttf.CmapSubtable.processSubtype13(CmapSubtable.java:320)
at 
org.apache.fontbox.ttf.CmapSubtable.initSubtable(CmapSubtable.java:113)
at org.apache.fontbox.ttf.CmapTable.read(CmapTable.java:87)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:365)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:165)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:144)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:56)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:50)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addTrueTypeFont(FileSystemFontProvider.java:684)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.scanFonts(FileSystemFontProvider.java:390)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.(FileSystemFontProvider.java:365)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.(FontMapperImpl.java:139)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java:158)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:410)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getTrueTypeFont(FontMapperImpl.java:318)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:142)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:153)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:893)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:531)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:506)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:362)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1369)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:235)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:215)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at com.xxx.content.Tika.extract(Tika.java:49)
at com.xxx.content.Tika.extractText(Tika.java:33)
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862888#comment-17862888
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207307712

   I tried using `3.0.0-SNAPSHOT` in `.pom` file but can't find it.
   
   ```
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in central 
(https://repo1.maven.org/maven2/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in clojars 
(https://repo.clojars.org/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in 
sonatype-oss-public (https://oss.sonatype.org/content/groups/public/)
   Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in 
maven.alfresco.com (https://maven.alfresco.com/nexus/content/groups/public/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in central 
(https://repo1.maven.org/maven2/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in clojars 
(https://repo.clojars.org/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in sonatype-oss-public 
(https://oss.sonatype.org/content/groups/public/)
   Could not find artifact 
org.apache.tika:tika-parsers-standard-package:jar:3.0.0 in maven.alfresco.com 
(https://maven.alfresco.com/nexus/content/groups/public/)
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862882#comment-17862882
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207177307

   If you can pull from the Apache snapshots repo, you can grab it from there?
   
   
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-parsers-standard-package/3.0.0-SNAPSHOT/
   
   I think we're at a good time to cut a BETA2 release either on Friday or
   next week. I've just pinged the dev list to get feedback on that.
   
   On Wed, Jul 3, 2024 at 4:15 PM Kiran Bachu ***@***.***> wrote:
   
   > Sorry, I cannot use 3.0.0-BETA, I can only use 2.9.2.
   >
   > 
   >   org.apache.tika
   >   tika-parsers-standard-package
   >   2.9.2
   > 
   >
   > Is there a way to use latest beta version for testing?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862881#comment-17862881
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207164473

   Sorry, I cannot use 3.0.0-BETA, I can only use 2.9.2.
   ```
   
 org.apache.tika
 tika-parsers-standard-package
 2.9.2
   
   ```
   Is there a way to use latest beta version for testing?




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862880#comment-17862880
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207132252

   Y, it should have been in 3.0.0-BETA.  How are you using it?
   




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862879#comment-17862879
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

kbachuHighSpot commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2207113861

   This is great. Thanks for working on this.
   Is this released with 3.0.0 thats in Beta? Because its listed here - 
https://tika.apache.org/3.0.0-BETA/index.html
   
   We are blocked on upgrading `pdfbox` to `3.0.2` because we also use `tika 
1.24` that is not compatible with `pdfbox 3.0.2`.
   
   I just tried upgrading `tika` to 3.0.0(Beta), I am still getting the error.
   ```
   Execution error (NoSuchMethodError) at 
org.apache.tika.parser.pdf.PDFParser/getPDDocument (PDFParser.java:506).
   'org.apache.pdfbox.pdmodel.PDDocument 
org.apache.pdfbox.pdmodel.PDDocument.load(java.io.File, java.lang.String, 
org.apache.pdfbox.io.MemoryUsageSetting)'
   ```




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844391#comment-17844391
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098906252

   I just asked on our dev list. I'd like to get 3.x out soon. We need a beta2 
release, though, I think.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844325#comment-17844325
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

dsvensson commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098585131

   @tballison Will this be backported to Tika 2.x, or if not, how far off is 
Tika 3.x?




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844324#comment-17844324
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

danielstravito commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098582675

   @tballison Will this be backported to Tika 2.x, or if not, how far off is 
Tika 3.x?




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-12-01 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792128#comment-17792128
 ] 

Hudson commented on TIKA-3347:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1410 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1410/])
TIKA-3347 -- upgrade to PDFBox 3.x (#1473) (github: 
[https://github.com/apache/tika/commit/78b897bdb28ba478cc470896ffb2ef9baf6bff8c])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/AccessPermissions.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/crypto/TSDParserTest.java
* (edit) 
tika-fuzzing/src/main/java/org/apache/tika/fuzzing/pdf/PDFTransformer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/VectorGraphicsOnlyPDFRenderer.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFIncrementalUpdatesTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-font-module/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/renderer/pdf/pdfbox/PDFBoxRenderer.java
* (edit) 
tika-fuzzing/src/main/java/org/apache/tika/fuzzing/pdf/EvilCOSWriter.java
* (edit) tika-parent/pom.xml
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/test/java/org/apache/tika/parser/indesign/IDMLParserTest.java


> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-12-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792109#comment-17792109
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison merged PR #1473:
URL: https://github.com/apache/tika/pull/1473




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790515#comment-17790515
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-1829662977

   Until PDFBox 3.0.1 is released, this requires a local build of PDFBox. I'll 
move this out of draft stage once PDFBox is released.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790514#comment-17790514
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison opened a new pull request, #1473:
URL: https://github.com/apache/tika/pull/1473

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-11-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790509#comment-17790509
 ] 

Tim Allison commented on TIKA-3347:
---

I just merged main into TIKA-3347 with PDFBox 3.0.1's rc1. Once the vote has 
passed for 3.0.1, let's work towards a Tika 3.0.0-BETA release?

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-09-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765202#comment-17765202
 ] 

Tim Allison commented on TIKA-3347:
---

Sorry for my delay, I've posted the extracts generated from both runs here:

[https://corpora.tika.apache.org/base/share/example_extracts/]

 

I'll try to figure out the diff.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-09-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764461#comment-17764461
 ] 

Tilman Hausherr commented on TIKA-3347:
---

I ran PDFBox extractText on the file... the extractions are identical with 
both. Another weird thing is that "федерации" occurs 1020 times, but tika 
reports 1015 times in the A version (TOP_N_TOKENS_A). A manual run with a tika 
build from a few days ago brings a file that also has 1020 occurences.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-09-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764275#comment-17764275
 ] 

Tilman Hausherr commented on TIKA-3347:
---

3 files:
bug_trackers/poppler/poppler-58785-0.zip-7.pdf isn't good with Adobe so doesn't 
matter
commoncrawl3_refetched/HQ/HQXZGM6CGDEGMIWX5PDFEGN7MLPYWROP: might be a real 
difference, needs more investigation
govdocs1/372/372582.pdf: it's true that 3.0 is losing a bit, but the file is 
also a mess with 2.0.29

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764261#comment-17764261
 ] 

Tim Allison commented on TIKA-3347:
---

Latest reports are here: 
https://corpora.tika.apache.org/base/reports/pdfbox-3.x-20230912.tgz

I haven't had a chance to look at all. :(

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-09-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762117#comment-17762117
 ] 

Tim Allison commented on TIKA-3347:
---

https://github.com/apache/camel-quarkus/issues/5234

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760087#comment-17760087
 ] 

Tim Allison commented on TIKA-3347:
---

Y. That’s probably only reproducible multithreaded? I’ll see what I can do…

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760037#comment-17760037
 ] 

Tilman Hausherr commented on TIKA-3347:
---

The IllegalArgumentException was fixed in PDFBOX-5652. So it was two issues. 
The other issue was fixed this morning, i.e. not in time. I couldn't reproduce 
the ConcurrentModificationException by using the -z option, or by opening the 
file with PDFBox. But maybe I didn't really try very hard.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760030#comment-17760030
 ] 

Tim Allison commented on TIKA-3347:
---

>From your reports, it looks like there's a new (very rare) 
>IllegalArgumentException, and this may be a good thing!

{noformat}
java.lang.IllegalArgumentException
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.extractDataFromSingleSubstTableFormat2Table(GlyphSubstitutionDataExtractor.java:223)
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.extractData(GlyphSubstitutionDataExtractor.java:183)
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.populateGsubData(GlyphSubstitutionDataExtractor.java:153)
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.populateGsubData(GlyphSubstitutionDataExtractor.java:139)
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.buildMapBackedGsubData(GlyphSubstitutionDataExtractor.java:102)
at 
org.apache.fontbox.ttf.gsub.GlyphSubstitutionDataExtractor.getGsubData(GlyphSubstitutionDataExtractor.java:70)
at 
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:107)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:365)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:165)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:144)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
{noformat}

The other one is more troubling, though.  It only occurs once, but how does it 
happen?

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760023#comment-17760023
 ] 

Tim Allison commented on TIKA-3347:
---

Oh, wow.  Great!  I just got a clean build and pushed what I had to the issue 
branch.

Your notes were super helpful!

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760014#comment-17760014
 ] 

Tilman Hausherr commented on TIKA-3347:
---

I did: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.29_vs_3.0.0.tar.xz

I can't find my own postings about it, but it resulted in only one new issue, 
PDFBOX-5649. It's possible that there were other ones that I didn't bother 
about.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2023-08-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759961#comment-17759961
 ] 

Tim Allison commented on TIKA-3347:
---

PDFBox 3.0.0 was just released. I'm going to wipe out the earlier TIKA-3347 
branch and start a new branch with the same name with the latest Tika and the 
actual PDFBox 3.0.0 release.  I'm looking forward to the regression tests.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315796#comment-17315796
 ] 

Tilman Hausherr commented on TIKA-3347:
---

No, I don't think so. I did that angle thing as an example both in 2.0 and 3.0 
at the same time in PDFBOX-4371 and it uses the same strategy.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315786#comment-17315786
 ] 

Tim Allison commented on TIKA-3347:
---

Will this be problematic in 3.x? 
https://github.com/apache/tika/blob/6114fac007291b3f80e202f0f4b17411bbb6d12d/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java#L276

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315785#comment-17315785
 ] 

Tim Allison commented on TIKA-3347:
---

Interesting... I don't think we're writing to those files, but I'll check  
Thank you!

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315737#comment-17315737
 ] 

Tilman Hausherr commented on TIKA-3347:
---

I don't have the code here, but maybe this might help re (1): make sure not to 
read from a PDF file and then write to it. Write to a different file. This is 
because of the "on demand" logic that writing might destroy what you are still 
reading, see PDFBOX-5086.


> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315639#comment-17315639
 ] 

Tim Allison commented on TIKA-3347:
---

Fixed {{PDFMarkedContent2XHTML}}...still need to work on 2 and figure out if 
the files in 1 are genuinely corrupt.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315594#comment-17315594
 ] 

Tim Allison commented on TIKA-3347:
---

Sorry, I misspoke above about incremental parser...I meant the new "on demand" 
parsing

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315582#comment-17315582
 ] 

Tim Allison commented on TIKA-3347:
---

Thank you, [~tilman]!  

Most of the tests now pass.  I started a TIKA-3347 branch.

1) The access checker tests in PDFParserTest don't pass because the parser now 
has problems with the four files (flate compression exception)...the relevant 
tests are ignored (  @Ignore("failing in 3.x")) 

2) The PreflightParser needs help.  I need to figure out what changed.  My 
hasty attempts didn't work...code compiles, but I'm sure I didn't do the right 
thing.

3) The marked content tests aren't working.  My fixes in the 
PDFMarkedContent2XHTML were not correct...I'll look into these.

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315210#comment-17315210
 ] 

Tilman Hausherr commented on TIKA-3347:
---

No performance benefit, this is just a structure redesign. But yes the code 
should be adjusted.

To get the unicode, call font.toUnicode(code).

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2021-04-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315187#comment-17315187
 ] 

Tim Allison commented on TIKA-3347:
---

[~tilman]...please let me know if I should ask over on user@pdfbox...

Will we get the performance benefits of the incremental parser by swapping in 
{{Loader.loadPDF(...)}} for {{PDDocument.load()}}?

Any recs for counting missing unicode char mappings here: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L1041

It doesn't look like the unicode string is part of {{showGlyph(...)}} any more?

> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)