Re: Enabling community around tika-helm

2024-11-10 Thread Tim Allison
Thank you, Lewis!

Add committers?

On Sat, Nov 9, 2024 at 4:08 PM lewis john mcgibbney 
wrote:

> Hi dev@,
>
> Over the last number of years, tika-helm [0] has been doing pretty well
> however I see a problem.
>
> I had an issue (now fixed) and was not seeing any email activity so
> essentially incoming contributions went ignored. I am addressing that this
> weekend and will push a release however I want to ask the dev@ community
> for input on enabling community building which would remove me (lewismc) as
> a bottleneck/single point of failure.
>
> The contributions I refer to above did not come from existing Tika
> Committers.
>
> Does anyone have suggestions?
>
> Thank you
> lewismc
>
> [0] https://github.com/apache/tika-helm
>
>
> Lewis J. McGibbney Ph.D
>


[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats

2024-11-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4345:
--
Description: 
At least in the OutlookExtractor, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
that need to be modified. We're not writing the from/to etc in the RFC822Parser 
into the content stream.

I propose that this be a non-breaking/opt-in option in 3.x, and then the 
default in 4.x.

In thinking about this more, I think we should get rid of injection of the 
header info into the content in msg files in 4.x. If users want it, we can add 
it back and do it correctly -- in .eml, outlook and pst. It is weird that we 
currently have it only msg. 

So, for 3.x, I propose that we allow users to turn this off in msg files. For 
4.x, we just won't do it...unless someone opens a ticket.

Let me know what you think/if there are any objections.



  was:
At least in the OutlookExtractor, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
that need to be modified. We're not writing the from/to etc in the RFC822Parser 
into the content stream.

I propose that this be a non-breaking/opt-in option in 3.x, and then the 
default in 4.x.


> Allow body-only content extraction for msg and other email formats
> --
>
> Key: TIKA-4345
> URL: https://issues.apache.org/jira/browse/TIKA-4345
> Project: Tika
>      Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> At least in the OutlookExtractor, we're writing some of the headers into the 
> content stream. For some use cases, it would be helpful to extract only the 
> body content into the content stream.
> Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
> that need to be modified. We're not writing the from/to etc in the 
> RFC822Parser into the content stream.
> I propose that this be a non-breaking/opt-in option in 3.x, and then the 
> default in 4.x.
> In thinking about this more, I think we should get rid of injection of the 
> header info into the content in msg files in 4.x. If users want it, we can 
> add it back and do it correctly -- in .eml, outlook and pst. It is weird that 
> we currently have it only msg. 
> So, for 3.x, I propose that we allow users to turn this off in msg files. For 
> 4.x, we just won't do it...unless someone opens a ticket.
> Let me know what you think/if there are any objections.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats

2024-11-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4345:
--
Description: 
At least in the OutlookExtractor, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
that need to be modified. We're not writing the from/to etc in the RFC822Parser 
into the content stream.

I propose that this be a non-breaking/opt-in option in 3.x, and then the 
default in 4.x.

  was:
At least in the OutlookExtractor, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
that need to be modified. We're not writing the from/to etc in the RFC822Parser 
into the content stream.

I propose that this be a non-breaking/opt-in option.


> Allow body-only content extraction for msg and other email formats
> --
>
> Key: TIKA-4345
> URL: https://issues.apache.org/jira/browse/TIKA-4345
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> At least in the OutlookExtractor, we're writing some of the headers into the 
> content stream. For some use cases, it would be helpful to extract only the 
> body content into the content stream.
> Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
> that need to be modified. We're not writing the from/to etc in the 
> RFC822Parser into the content stream.
> I propose that this be a non-breaking/opt-in option in 3.x, and then the 
> default in 4.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4345) Allow body-only content extraction for msg and other email formats

2024-11-07 Thread Tim Allison (Jira)
Tim Allison created TIKA-4345:
-

 Summary: Allow body-only content extraction for msg and other 
email formats
 Key: TIKA-4345
 URL: https://issues.apache.org/jira/browse/TIKA-4345
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


At least in the OutlookParser, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

I propose that this be a non-breaking/opt-in option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats

2024-11-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4345:
--
Description: 
At least in the OutlookExtractor, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
that need to be modified. We're not writing the from/to etc in the RFC822Parser 
into the content stream.

I propose that this be a non-breaking/opt-in option.

  was:
At least in the OutlookParser, we're writing some of the headers into the 
content stream. For some use cases, it would be helpful to extract only the 
body content into the content stream.

I propose that this be a non-breaking/opt-in option.


> Allow body-only content extraction for msg and other email formats
> --
>
> Key: TIKA-4345
> URL: https://issues.apache.org/jira/browse/TIKA-4345
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> At least in the OutlookExtractor, we're writing some of the headers into the 
> content stream. For some use cases, it would be helpful to extract only the 
> body content into the content stream.
> Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
> that need to be modified. We're not writing the from/to etc in the 
> RFC822Parser into the content stream.
> I propose that this be a non-breaking/opt-in option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4344) Add wrapper for magika detector

2024-11-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4344.
---
Fix Version/s: 4.0.0
   3.1.0
   Resolution: Fixed

> Add wrapper for magika detector
> ---
>
> Key: TIKA-4344
> URL: https://issues.apache.org/jira/browse/TIKA-4344
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 4.0.0, 3.1.0
>
>
> https://github.com/google/magika
> See also: https://www.youtube.com/watch?v=PBbld8xB2Bo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4344) Add wrapper for magika detector

2024-11-05 Thread Tim Allison (Jira)
Tim Allison created TIKA-4344:
-

 Summary: Add wrapper for magika detector
 Key: TIKA-4344
 URL: https://issues.apache.org/jira/browse/TIKA-4344
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


https://github.com/google/magika

See also: https://www.youtube.com/watch?v=PBbld8xB2Bo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4337) Improvements to recent xps mods

2024-11-01 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4337.
---
Fix Version/s: 4.0.0
   3.1.0
   Resolution: Fixed

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 4.0.0, 3.1.0
>
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894629#comment-17894629
 ] 

Tim Allison commented on TIKA-4337:
---

Y, I completely agree about the "opportunistic improvement." I think this could 
be an area for future work, but it is not applicable broadly.

The licenses for those files are definitely not Apache 2.0 compliant... so we 
can't include them directly n our unit tests. :( 

However, I could put them in our regression corpus, and we'd see changes 
whenever we run large scale regression testing before a release. This is not 
ideal, but is the best we can do.

Do any fellow devs ([~tilman] [~nick] ?) know if we could try to download the 
files as part of the build process and then incorporate local copies into unit 
tests? I know PDFBox downloads some files for unit tests, but I don't know what 
they're licensing is... Or does this go against the spirit of the Apache 
license?

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4343) Remove agepredictor in 4.x

2024-10-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4343:
-

 Summary: Remove agepredictor in 4.x
 Key: TIKA-4343
 URL: https://issues.apache.org/jira/browse/TIKA-4343
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4341) Fix deserialization of MetadataListFilter

2024-10-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4341.
---
Fix Version/s: 4.0.0
   3.1.0
   Resolution: Fixed

> Fix deserialization of MetadataListFilter
> -
>
> Key: TIKA-4341
> URL: https://issues.apache.org/jira/browse/TIKA-4341
> Project: Tika
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Trivial
> Fix For: 4.0.0, 3.1.0
>
>
> MetadataListFilter in its {{load(_)}} should expect children of type 
> MetadataListFilter, not MetadataFilter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4342) Remove tika-batch from tika-eval's FileProfiler

2024-10-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4342:
-

 Summary: Remove tika-batch from tika-eval's FileProfiler
 Key: TIKA-4342
 URL: https://issues.apache.org/jira/browse/TIKA-4342
 Project: Tika
  Issue Type: Sub-task
Reporter: Tim Allison


FileProfiler is the simplest handler. Let's start there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4340) Remove tika-batch from tika-app

2024-10-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4340.
---
Resolution: Fixed

> Remove tika-batch from tika-app
> ---
>
> Key: TIKA-4340
> URL: https://issues.apache.org/jira/browse/TIKA-4340
> Project: Tika
>  Issue Type: Sub-task
>        Reporter: Tim Allison
>Priority: Major
>
> Remove tika-batch option from tika-app and support translating basic 
> commandline args into a call to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4341) Fix deserialization of MetadataListFilter

2024-10-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4341:
-

 Summary: Fix deserialization of MetadataListFilter
 Key: TIKA-4341
 URL: https://issues.apache.org/jira/browse/TIKA-4341
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison


MetadataListFilter in its {{load(_)}} should expect children of type 
MetadataListFilter, not MetadataFilter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Apache Tika in the ASF project spotlight

2024-10-29 Thread Tim Allison
https://news.apache.org/foundation/entry/asf-project-spotlight-apache-tika


[jira] [Commented] (TIKA-4314) CompositeParser returns only one parser per content type

2024-10-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893842#comment-17893842
 ] 

Tim Allison commented on TIKA-4314:
---

Sorry for dropping the ball on this. The SupplementingParser is definitely the 
way to go with this. I looked at it this morning and we haven't wired up the 
serialization/configuration so that you can easily specify which component 
parsers go into that Parser.

If we did that, then we would be able to configure 
{{o.a.t.p.external2.ExternalParser}}s for each commandline you wanted.

If you don't need to configure this via xml, e.g. you're running Tika 
programmatically, this should be not too hard.

> CompositeParser returns only one parser per content type
> 
>
> Key: TIKA-4314
> URL: https://issues.apache.org/jira/browse/TIKA-4314
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.9.2
>Reporter: Leszek Sliwko
>Priority: Major
> Attachments: duration-test-2.avi, geolocation-test-1.jpg, 
> geolocation-test-2.jpg
>
>
> External parsers can have many supported content types, but information is 
> lost in CompositeParser:
>  
> public Map getParsers(ParseContext context) {
>   Map map = new HashMap<>();
>   for (Parser parser : parsers) {
>     for (MediaType type : parser.getSupportedTypes(context))
> {        map.put(registry.normalize(type), parser); }
>    }
>    return map;
> }
>  
> To recreate - parse any avi file (content type: video/x-msvideo), Only the 
> exiftool will by picked up and the ffmpeg parser won't be executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4314) CompositeParser returns only one parser per content type

2024-10-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893880#comment-17893880
 ] 

Tim Allison commented on TIKA-4314:
---

Got it. Thank you. I like what you've done. There are a few challenges with 
this route. 

The default legacy ExternalParser that is loaded by TikaConfig or by default is 
a CompositeParser. If we change the behavior of the CompositeParser, that will 
have unintended consequences on other combinations of parsers. The basic design 
in Tika is one parser per file type.

Another issue is that this relies on the legacy ExternalParser which wraps a 
number of external parsers. We're moving towards the more robust and flexible 
{{o.a.t.p.external2.ExternalParser}}. So, I don't think we want to have such a 
major change hinging on something that will be deprecated in 3.x and removed by 
4.x (maybe? depending on community discussions/feedback).

I think it would be much better to use the SupplementingParser, and have it 
wrap the ExternalParsers that you want.

If we head in this direction, what will it take to get this working for you? 
Are you able to configure your parsers programmatically, or are you using 
tika-server or something else where you need to configure the parsers via 
tika-config.xml?

> CompositeParser returns only one parser per content type
> 
>
> Key: TIKA-4314
> URL: https://issues.apache.org/jira/browse/TIKA-4314
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.9.2
>Reporter: Leszek Sliwko
>Priority: Major
> Attachments: CompositeParser.java, duration-test-2.avi, 
> geolocation-test-1.jpg, geolocation-test-2.jpg
>
>
> External parsers can have many supported content types, but information is 
> lost in CompositeParser:
>  
> public Map getParsers(ParseContext context) {
>   Map map = new HashMap<>();
>   for (Parser parser : parsers) {
>     for (MediaType type : parser.getSupportedTypes(context))
> {        map.put(registry.normalize(type), parser); }
>    }
>    return map;
> }
>  
> To recreate - parse any avi file (content type: video/x-msvideo), Only the 
> exiftool will by picked up and the ffmpeg parser won't be executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4340) Remove tika-batch from tika-app

2024-10-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-4340:
-

 Summary: Remove tika-batch from tika-app
 Key: TIKA-4340
 URL: https://issues.apache.org/jira/browse/TIKA-4340
 Project: Tika
  Issue Type: Sub-task
Reporter: Tim Allison


Remove tika-batch option from tika-app and support translating basic 
commandline args into a call to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17

2024-10-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893625#comment-17893625
 ] 

Tim Allison commented on TIKA-4322:
---

I just updated the Jenkins main-jdk17 to push to the snapshot repo. 

> Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
> --
>
> Key: TIKA-4322
> URL: https://issues.apache.org/jira/browse/TIKA-4322
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893603#comment-17893603
 ] 

Tim Allison commented on TIKA-4337:
---

This is relevant for text extraction. I am NOT suggesting that you implement 
this, though.


 It looks like xps also has structure/tags/structure info like PDF does to try 
to group text pieces logically. If you look at: 
10b9b1c63da0c725f74256f22bbd4956a64b35cea3edc6ab6a43eeb7710888d6, there's a 
Structure directory under Document, and in the Fragments subdir, there are 
lists of which text runs should be in the same paragraph.


{code:xml}

  
   
  
  
   
   
  
  
   
   
  
{code}


> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893594#comment-17893594
 ] 

Tim Allison commented on TIKA-4337:
---

Again, unrelated to the text extraction work -- this is just something... It 
looks like xps "stripes" images just like some PDF creators do/used to do. See: 
a28ab64ba223643c6a30d542deb543e2ea3acac911f04a7784c9d3d9f583df01

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893589#comment-17893589
 ] 

Tim Allison commented on TIKA-4337:
---

[~ruairidh-next], this is totally unrelated to the text extraction code that 
you're working on, but if you come across attachments in any of these files, 
please let me know. I don't even know if xps allows it (it must?!), and I plan 
to do my own analysis on the recent batch of files.

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893518#comment-17893518
 ] 

Tim Allison commented on TIKA-4337:
---

Fantastic. Thank you!

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module

2024-10-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892818#comment-17892818
 ] 

Tim Allison commented on TIKA-4338:
---

Thank you for opening this issue [~sandeep_kulkarni].

> Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
> --
>
> Key: TIKA-4338
> URL: https://issues.apache.org/jira/browse/TIKA-4338
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sandeep Kulkarni
>Priority: Major
> Fix For: 4.0.0, 3.1.0
>
>
> As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced 
> with JSoup. I had requested for its removal earlier in TIKA-4109.
> So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of 
> the dependency component of tika-parser-code-module. It seems to be only 
> removed from tika-parser-html-module.
> So is it possible to completely get rid of TagSoup from Tika as it is EOL? 
> tika-parser-code-module has dependency of 
> *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Next 3.x should be 3.1.0?

2024-10-25 Thread Tim Allison
All,
  In looking already at the mods to the 3.x branch, I think the next
release will be a minor revision, not a patch. In short, I think we
should go w 3.1.0 for the next 3.x release.
   Let me know if you disagree.

   Best,

   Tim


[jira] [Resolved] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module

2024-10-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4338.
---
Fix Version/s: 3.1.0
   4.0.0
 Assignee: (was: Tim Allison)
   Resolution: Fixed

> Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
> --
>
> Key: TIKA-4338
> URL: https://issues.apache.org/jira/browse/TIKA-4338
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sandeep Kulkarni
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
>
> As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced 
> with JSoup. I had requested for its removal earlier in TIKA-4109.
> So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of 
> the dependency component of tika-parser-code-module. It seems to be only 
> removed from tika-parser-html-module.
> So is it possible to completely get rid of TagSoup from Tika as it is EOL? 
> tika-parser-code-module has dependency of 
> *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4339) Wrong mimetype for font

2024-10-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-4339:
-

Assignee: Tim Allison

> Wrong mimetype for font
> ---
>
> Key: TIKA-4339
> URL: https://issues.apache.org/jira/browse/TIKA-4339
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Gustavo de Oliveira Silva
>    Assignee: Tim Allison
>Priority: Minor
> Attachments: suggestion.diff
>
>
> The current font mimetype are `application/x-font-otf` and 
> `application/x-font-ttf`
> They are this since 2009. But with RFC8081, IANA added the type `font/*` for 
> fonts.
> IANA: [https://www.iana.org/assignments/media-types/media-types.xhtml#font]
> RFC8081: [https://www.rfc-editor.org/rfc/rfc8081.html]
>  
> Almost only change `application/x-font-otf` to `font/otf` and 
> `application/x-font-ttf` to `font/ttf`.
> The attached file is a suggestion



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4339) Wrong mimetype for font

2024-10-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892812#comment-17892812
 ] 

Tim Allison commented on TIKA-4339:
---

Thank you for opening this issue. A few questions...

Do we leave the other fonts as is? There are a bunch of other font types. What 
do we do with those?

I see woff in the spec, but I don't think we handle that. As a separate issue, 
should we add woff and woff2 detection?

Is this a breaking enough change that we should keep it only in the 4.x branch 
and not make the change in the 3.x branch?



> Wrong mimetype for font
> ---
>
> Key: TIKA-4339
> URL: https://issues.apache.org/jira/browse/TIKA-4339
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Gustavo de Oliveira Silva
>Assignee: Tim Allison
>Priority: Minor
> Attachments: suggestion.diff
>
>
> The current font mimetype are `application/x-font-otf` and 
> `application/x-font-ttf`
> They are this since 2009. But with RFC8081, IANA added the type `font/*` for 
> fonts.
> IANA: [https://www.iana.org/assignments/media-types/media-types.xhtml#font]
> RFC8081: [https://www.rfc-editor.org/rfc/rfc8081.html]
>  
> Almost only change `application/x-font-otf` to `font/otf` and 
> `application/x-font-ttf` to `font/ttf`.
> The attached file is a suggestion



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module

2024-10-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-4338:
-

Assignee: Tim Allison

> Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
> --
>
> Key: TIKA-4338
> URL: https://issues.apache.org/jira/browse/TIKA-4338
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sandeep Kulkarni
>    Assignee: Tim Allison
>Priority: Major
>
> As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced 
> with JSoup. I had requested for its removal earlier in TIKA-4109.
> So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of 
> the dependency component of tika-parser-code-module. It seems to be only 
> removed from tika-parser-html-module.
> So is it possible to completely get rid of TagSoup from Tika as it is EOL? 
> tika-parser-code-module has dependency of 
> *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4337) Improvements to recent xps mods

2024-10-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892588#comment-17892588
 ] 

Tim Allison edited comment on TIKA-4337 at 10/24/24 6:43 PM:
-

cc [~ruairidh-next] ... no good deed goes unpunished. :D

You've done plenty. I can make these fixes. If you're curious, though, please 
do take a look.


was (Author: talli...@mitre.org):
cc [~ruairidh-next] ... no good deed goes unpunished. :D

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4337) Improvements to recent xps mods

2024-10-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892588#comment-17892588
 ] 

Tim Allison commented on TIKA-4337:
---

cc [~ruairidh-next] ... no good deed goes unpunished. :D

> Improvements to recent xps mods
> ---
>
> Key: TIKA-4337
> URL: https://issues.apache.org/jira/browse/TIKA-4337
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Minor
> Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4337) Improvements to recent xps mods

2024-10-24 Thread Tim Allison (Jira)
Tim Allison created TIKA-4337:
-

 Summary: Improvements to recent xps mods
 Key: TIKA-4337
 URL: https://issues.apache.org/jira/browse/TIKA-4337
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
 Attachments: xps-reports.tgz

I pulled 249 xps files out of the latest commoncrawl crawl and compared 
3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
number format exceptions where a comma-delimited string is parsed as if it were 
an integer.

Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
content_diffs_no_exceptions.xlsx.

The source files are available here: 
https://corpora.tika.apache.org/base/share/xps.tgz





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4336) 'application/json' is not instance of 'text/plain'

2024-10-24 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4336.
---
Fix Version/s: 3.0.1
   4.0.0
   Resolution: Fixed

Thank you [~valfirst]!

> 'application/json' is not instance of 'text/plain'
> --
>
> Key: TIKA-4336
> URL: https://issues.apache.org/jira/browse/TIKA-4336
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Valery Yatsynovich
>Priority: Major
> Fix For: 3.0.1, 4.0.0
>
>
> {{MediaTypeRegistry.getDefaultRegistry().isInstanceOf("application/json", 
> MediaType.TEXT_PLAIN)}}
> => true on version {{2.9.2}}
> => false on version {{3.0.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4315) XPS file parser does not emit whitespace as expected

2024-10-24 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4315.
---
Fix Version/s: 2.9.3
   3.0.1
   4.0.0
   Resolution: Fixed

Thank you [~ruairidh-next]!

> XPS file parser does not emit whitespace as expected
> 
>
> Key: TIKA-4315
> URL: https://issues.apache.org/jira/browse/TIKA-4315
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1, 2.9.2
>Reporter: Ruairidh Williamson
>Priority: Major
> Fix For: 2.9.3, 3.0.1, 4.0.0
>
> Attachments: testXLSX.xps
>
>
> We are using tika to extract text from XPS files and have hit an issue where 
> whitespace is not emitted where we would expect. See the attached example 
> file where opening the file it visually has a large gap between "x" and 
> "abcde1234f" but when extracted by tika it calls `characters` with "x" and 
> then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in 
> between those calls but we don't get one.
> I have a pull request that fixes the issue which I will submit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4336) 'application/json' is not instance of 'text/plain'

2024-10-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892119#comment-17892119
 ] 

Tim Allison commented on TIKA-4336:
---

TIKA-4119 changed {{application/javascript}} -> {{text/javascript}}. json is a 
subclass of {{application/javascript}} so the lookup is now missing the path to 
{{text/plain}}. We can modify the subclass of json to {{text/javascript}} and 
we should be good to go.

If you have time, a PR with a new unit test based on your code above would help 
move this more quickly.

Thank you!

> 'application/json' is not instance of 'text/plain'
> --
>
> Key: TIKA-4336
> URL: https://issues.apache.org/jira/browse/TIKA-4336
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Valery Yatsynovich
>Priority: Major
>
> {{MediaTypeRegistry.getDefaultRegistry().isInstanceOf("application/json", 
> MediaType.TEXT_PLAIN)}}
> => true on version {{2.9.2}}
> => false on version {{3.0.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4330) Add a MetadataListFilter

2024-10-22 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4330.
---
Fix Version/s: 4.0.0
   3.0.1
   Resolution: Fixed

> Add a MetadataListFilter
> 
>
> Key: TIKA-4330
> URL: https://issues.apache.org/jira/browse/TIKA-4330
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 4.0.0, 3.0.1
>
>
> We currently have MetadataFilters that operate on a single metadata instance. 
> There are some use cases where a filter on one metadata instance in the list 
> needs access to other information in the other metadata objects in the list.
> The simplest use case for this would be to populate an "attachment_count" 
> metadata field in the parent's metadata object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4335) Refactor tika-server to avoid a shaded/fat jar in 4.x

2024-10-22 Thread Tim Allison (Jira)
Tim Allison created TIKA-4335:
-

 Summary: Refactor tika-server to avoid a shaded/fat jar in 4.x
 Key: TIKA-4335
 URL: https://issues.apache.org/jira/browse/TIKA-4335
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4333) Remove tika-batch from 4.x/main

2024-10-22 Thread Tim Allison (Jira)
Tim Allison created TIKA-4333:
-

 Summary: Remove tika-batch from 4.x/main
 Key: TIKA-4333
 URL: https://issues.apache.org/jira/browse/TIKA-4333
 Project: Tika
  Issue Type: Task
  Components: tika-batch
Reporter: Tim Allison


Move batch processing in tika-app and tika-eval to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4334) Move tika pipes components in tika-core to tika-pipes-core in 4.x

2024-10-22 Thread Tim Allison (Jira)
Tim Allison created TIKA-4334:
-

 Summary: Move tika pipes components in tika-core to 
tika-pipes-core in 4.x
 Key: TIKA-4334
 URL: https://issues.apache.org/jira/browse/TIKA-4334
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4332) Consider removing dotnet module in 4.x/main

2024-10-22 Thread Tim Allison (Jira)
Tim Allison created TIKA-4332:
-

 Summary: Consider removing dotnet module in 4.x/main
 Key: TIKA-4332
 URL: https://issues.apache.org/jira/browse/TIKA-4332
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


The dotnet module hasn't been updated since 1.11. Unless there are objections, 
let's remove it in 4.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4331) Bump tika-docker to ubuntu:oracular?

2024-10-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-4331:
-

 Summary: Bump tika-docker to ubuntu:oracular?
 Key: TIKA-4331
 URL: https://issues.apache.org/jira/browse/TIKA-4331
 Project: Tika
  Issue Type: Task
  Components: tika-docker
Reporter: Tim Allison


Should we bump the base image to oracular? I don't know enough about the diffs 
from noble.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4330) Add a MetadataListFilter

2024-10-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-4330:
-

 Summary: Add a MetadataListFilter
 Key: TIKA-4330
 URL: https://issues.apache.org/jira/browse/TIKA-4330
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We currently have MetadataFilters that operate on a single metadata instance. 
There are some use cases where a filter on one metadata instance in the list 
needs access to other information in the other metadata objects in the list.

The simplest use case for this would be to populate an "attachment_count" 
metadata field in the parent's metadata object.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4329) Release tika-3.0.0's docker image

2024-10-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4329.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Release tika-3.0.0's docker image
> -
>
> Key: TIKA-4329
> URL: https://issues.apache.org/jira/browse/TIKA-4329
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> Also bump jre to 21?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4329) Release tika-3.0.0's docker image

2024-10-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-4329:
-

 Summary: Release tika-3.0.0's docker image
 Key: TIKA-4329
 URL: https://issues.apache.org/jira/browse/TIKA-4329
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Also bump jre to 21?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4326) General updates for 3.0.1

2024-10-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891536#comment-17891536
 ] 

Tim Allison edited comment on TIKA-4326 at 10/21/24 1:10 PM:
-

Thank you [~tilman]. I just added jdk23 github action builds to branch_3x and 
main.


was (Author: talli...@mitre.org):
Thank you [~tilman]. I just added jdk23 builds to branch_3x and main.

> General updates for 3.0.1
> -
>
> Key: TIKA-4326
> URL: https://issues.apache.org/jira/browse/TIKA-4326
> Project: Tika
>  Issue Type: Task
>Reporter: Tilman Hausherr
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4326) General updates for 3.0.1

2024-10-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891536#comment-17891536
 ] 

Tim Allison commented on TIKA-4326:
---

Thank you [~tilman]. I just added jdk23 builds to branch_3x and main.

> General updates for 3.0.1
> -
>
> Key: TIKA-4326
> URL: https://issues.apache.org/jira/browse/TIKA-4326
> Project: Tika
>  Issue Type: Task
>Reporter: Tilman Hausherr
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4328) Update or remove tika-deployment snaps

2024-10-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-4328:
-

 Summary: Update or remove tika-deployment snaps
 Key: TIKA-4328
 URL: https://issues.apache.org/jira/browse/TIKA-4328
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


These haven't been updated since 2.0.0-SNAPSHOT, apparently. I doubt anyone is 
using them. We should either update them or remove them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17

2024-10-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891194#comment-17891194
 ] 

Tim Allison commented on TIKA-4322:
---

I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted 
{{-jdk11}} from main. I added {{tika-branch_3x-*}} jdks. There will likely be 
surprises, but this should be a decent start.

> Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
> --
>
> Key: TIKA-4322
> URL: https://issues.apache.org/jira/browse/TIKA-4322
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17

2024-10-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891194#comment-17891194
 ] 

Tim Allison edited comment on TIKA-4322 at 10/19/24 8:17 PM:
-

I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted 
{{-jdk11}} from main. 

I left in the disabled {{tika-branch_2x-jdk8}} as a reminder that that doesn't 
work. We can delete it in 6 months...if all goes well. :D

I added {{tika-branch_3x-*}} jdks. There will likely be surprises, but this 
should be a decent start.


was (Author: talli...@mitre.org):
I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted 
{{-jdk11}} from main. I added {{tika-branch_3x-*}} jdks. There will likely be 
surprises, but this should be a decent start.

> Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
> --
>
> Key: TIKA-4322
> URL: https://issues.apache.org/jira/browse/TIKA-4322
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4325) Consider removing some unsupported modules in 4.x

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4325:
--
Summary: Consider removing some unsupported modules in 4.x  (was: Consider 
removing some unsupported modules)

> Consider removing some unsupported modules in 4.x
> -
>
> Key: TIKA-4325
> URL: https://issues.apache.org/jira/browse/TIKA-4325
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> I propose removing tika-age-recogniser and tika-dl in 4.x. They'll still be 
> available in 3.x for at least a year. Any objections?
> Are there other modules that we'd like to remove?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4247:
--
Fix Version/s: 3.0.1
   (was: 3.0.0)

> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.1
>
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4247.
---
Resolution: Fixed

I think fixed this including the ParseContext in the FetchEmitTuple? Apologies 
if I've confused issues. Please re-open if I'm wrong.

> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] Apache Tika 3.0.0 released

2024-10-19 Thread Tim Allison
The Apache Tika project is pleased to announce the release of Apache
Tika 3.0.0. The release contents have been pushed out to the main
Apache release site and to the Maven Central sync.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries.

Apache Tika 3.0.0 includes numerous bug fixes and dependency upgrades.
The biggest change in the 3.x branch is that it requires >= Java 11.
Details can be found in the changes file:
https://www.apache.org/dist/tika/3.0.0/CHANGES-3.0.0.txt

Apache Tika is available on the download page:
https://tika.apache.org/download.html

Apache Tika will be available shortly in binary form or for use using
Maven 2 from the Central Repository:
https://repo1.maven.org/maven2/org/apache/tika/

When downloading, please remember to verify the downloads using
signatures found: https://www.apache.org/dist/tika/KEYS

For more information on Apache Tika, visit the project home page:
https://tika.apache.org/

NOTE: This release requires Java 11. We plan to support the
2.x branch (which requires Java 8) for six months after the
release of 3.0.0. See:
https://cwiki.apache.org/confluence/display/TIKA/Tika+Roadmap+--+2.x%2C+3.x+and+Beyond

-- Tim Allison, on behalf of the Apache Tika community


[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1907:
--
Fix Version/s: 3.0.1
   (was: 3.0.0)

> Big Pdf parsing to text - Out of memory
> ---
>
> Key: TIKA-1907
> URL: https://issues.apache.org/jira/browse/TIKA-1907
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Nicolas Daniels
>Priority: Major
> Fix For: 3.0.1
>
>
> Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
> I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe 
> PDFBox is not the appropriate lib to use in such case.
> Trying to read the same PDF using Tika leads to the same problem:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
>  StringWriter writer = new StringWriter();
>FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>   BodyContentHandler handler = new BodyContentHandler(fileWriter);
>   Metadata metadata = new Metadata();
>   new PDFParser().parse(inputStream, handler, metadata, new 
> ParseContext());
>  fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4323) Consider removing 1.x info from site?

2024-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-4323:
-

 Summary: Consider removing 1.x info from site?
 Key: TIKA-4323
 URL: https://issues.apache.org/jira/browse/TIKA-4323
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Do we need to keep the javadocs etc for the 1.x branch which has been EOL'd 
since September 2022?

Unless there are objections, I'll slim down our site to include only 2.x and 
above.

I certainly understand if we need to keep them around for archival purposes. 
Let me know.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4247:
--
Fix Version/s: 3.0.0
   (was: 3.0.1)

> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name

2024-10-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891179#comment-17891179
 ] 

Tim Allison commented on TIKA-4298:
---

This was resolved in 3.0.0 and should be closed?

> Failed to detect charset for zip entry with short non-Unicode file name
> ---
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Mingchun Zhao
>Priority: Major
> Fix For: 2.9.3, 3.0.1
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="org.apache.tika.parser.pkg.PackageParser"/>
> 
> 
> 
> 
> 
> 
> 
> 
> shiba.png
> 
> 
> ���1.txt
> あいうえお
> かきくけこ
> 
> 
> ���2.txt
> さしすせそ
> たちつてと
> 
> % {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4318:
--
Fix Version/s: 3.0.1
   (was: 3.0.0)

> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.1
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4325) Consider removing some unsupported modules

2024-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-4325:
-

 Summary: Consider removing some unsupported modules
 Key: TIKA-4325
 URL: https://issues.apache.org/jira/browse/TIKA-4325
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I propose removing tika-age-recogniser and tika-dl in 4.x. They'll still be 
available in 3.x for at least a year. Any objections?

Are there other modules that we'd like to remove?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4324) Update dependencies in main that require Java 17

2024-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-4324:
-

 Summary: Update dependencies in main that require Java 17
 Key: TIKA-4324
 URL: https://issues.apache.org/jira/browse/TIKA-4324
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


There are a handful of dependencies whose most recent versions require java 17. 
We can now make those updates in {{main}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name

2024-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4298:
--
Fix Version/s: 3.0.1
   (was: 3.0.0)

> Failed to detect charset for zip entry with short non-Unicode file name
> ---
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Mingchun Zhao
>Priority: Major
> Fix For: 2.9.3, 3.0.1
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="org.apache.tika.parser.pkg.PackageParser"/>
> 
> 
> 
> 
> 
> 
> 
> 
> shiba.png
> 
> 
> ���1.txt
> あいうえお
> かきくけこ
> 
> 
> ���2.txt
> さしすせそ
> たちつてと
> 
> % {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17

2024-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-4322:
-

 Summary: Create branch_3x and update main to 4.0.0-SNAPSHOT and 
Java 17
 Key: TIKA-4322
 URL: https://issues.apache.org/jira/browse/TIKA-4322
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[RESULT] [VOTE] Release Apache Tika 3.0.0 Candidate #1

2024-10-19 Thread Tim Allison
The vote has passed with 4 binding +1s and no -1s.

+1s

Nicholas DiPiazza
Oleg Tikhonov
Tilman Hausherr
Tim Allison

I'll update the website and release the artifacts in the next few days.

Thank you, all!

Best,

 Tim

On Wed, Oct 16, 2024 at 7:24 AM Tim Allison  wrote:
>
> A candidate for the Tika 3.0.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/3.0.0
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/3.0.0-rc1/
>
> The SHA-512 checksum of the archive is
> c5eb92bc895d96492b2d2577d14df6187e46ab7c8a9f64aaf19d4f140f07caf1223d073c2cbb47b5519bb952eee50f39563004b8ad49906f45dffc9b6df74350.
>
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1107/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 3.0.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 3.0.0
> [ ] -1 Do not release this package because...
>
>
> Here's my +1.
>
> Thank you, all!
>
> Best,
>
>  Tim


[jira] [Updated] (TIKA-4309) ExecutableParser: support MachO

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4309:
--
Fix Version/s: 3.0.1

> ExecutableParser: support MachO
> ---
>
> Key: TIKA-4309
> URL: https://issues.apache.org/jira/browse/TIKA-4309
> Project: Tika
>  Issue Type: New Feature
>Reporter: Alexey Pelykh
>Priority: Major
> Fix For: 3.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4321) Clean up Solr integration tests

2024-10-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4321:
-

 Summary: Clean up Solr integration tests
 Key: TIKA-4321
 URL: https://issues.apache.org/jira/browse/TIKA-4321
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Solr currently supports 9.x and 8.x. There's a vote on the Lucene side to stop 
support for Lucene 8.x soon. I think we can get rid of our unit tests for 6.x 
and 7.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4319) Wrong exit code upon successful start of Tika server

2024-10-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890181#comment-17890181
 ] 

Tim Allison commented on TIKA-4319:
---

Thank you for opening this. I'm not that familiar with the service scripts. If 
you can recommend a PR, that'd help.

Maybe [~epugh] might have some time to review?

> Wrong exit code upon successful start of Tika server
> 
>
> Key: TIKA-4319
> URL: https://issues.apache.org/jira/browse/TIKA-4319
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.9.2
> Environment: Tested on a Debian 12 VM with the following kernel:
> 6.1.0-11-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) 
> x86_64 GNU/Linux
>Reporter: Corrado Fiore
>Priority: Trivial
>  Labels: easyfix
>
> I was trying to create a systemd unit file for Tika and I noticed that the 
> server will return an exit code of `1` instead of `0`.  This makes it 
> confusing b/c systemd will report that the service crashed (whereas it 
> started correctly).
> h3. Steps to reproduce the problem
> {{{color:#00875a}user@test-instance{color}:~$ sudo su -c 
> "TIKA_INCLUDE=\"/etc/default/tika.in.sh\" /opt/tika/bin/tika start" - tika}}
> {{Default server /opt/tika/}}
> {{Waiting up to 180 seconds to see Tika running on port 9998 [-]  }}
> {{Started Tika server on port 9998 (pid=50039}}
> {{50001). Happy extracting!}}
> {{{color:#00875a}user@test-instance{color}:~$ echo $?}}
> {{1}}
> h3. Expected behaviour
> A command that executes successfully should exit with an exit code 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4320) Modernize opensearch integration tests

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4320.
---
Fix Version/s: 3.0.1
   Resolution: Fixed

> Modernize opensearch integration tests
> --
>
> Key: TIKA-4320
> URL: https://issues.apache.org/jira/browse/TIKA-4320
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.1
>
>
> We should remove the Elasticsearch 7.x unit tests and the OpenSearch 1.x 
> tests. And, we can now use OpenSearch's {{testcontainers}} module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4320) Modernize opensearch integration tests

2024-10-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4320:
-

 Summary: Modernize opensearch integration tests
 Key: TIKA-4320
 URL: https://issues.apache.org/jira/browse/TIKA-4320
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We should remove the Elasticsearch 7.x unit tests and the OpenSearch 1.x tests. 
And, we can now use OpenSearch's {{testcontainers}} module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files

2024-10-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890103#comment-17890103
 ] 

Tim Allison commented on TIKA-4170:
---

Unrelated, but I just noticed that LibreOffice can open the example keynote 
files, which means that we _could_ write a bridge to LibreOffice or OpenOffice 
to extract text from these?!

> Tika to extract Apple Key files
> ---
>
> Key: TIKA-4170
> URL: https://issues.apache.org/jira/browse/TIKA-4170
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, 
> keynotecreated.zip
>
>
> We are trying Tika to extract Apple Key files.  The testing data is attached.
>     Could you please check why Tika can't extract the Apple Key files from 
> Tika-2.9.0? 
>     The below testing result is for your reference.  Thank you.
>  
> Tika version  --> Have child documents after extracting?
>             2.4.1  --> YES
>             2.6.0  --> YES
>             2.7.0  --> YES
>             2.8.0  --> YES
>             2.9.0  --> NO  
>             2.9.1  --> NO  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4280) Tasks for the 3.0.0 release

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4280.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

3.0.0 rc1 is under vote now.

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 2PSMEFJEYU7EPAZXQQDD6OL2WOQLBJRY.zip
>
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files

2024-10-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890096#comment-17890096
 ] 

Tim Allison commented on TIKA-4170:
---

I see similar behavior in 2.9.3-SNAPSHOT.

> Tika to extract Apple Key files
> ---
>
> Key: TIKA-4170
> URL: https://issues.apache.org/jira/browse/TIKA-4170
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, 
> keynotecreated.zip
>
>
> We are trying Tika to extract Apple Key files.  The testing data is attached.
>     Could you please check why Tika can't extract the Apple Key files from 
> Tika-2.9.0? 
>     The below testing result is for your reference.  Thank you.
>  
> Tika version  --> Have child documents after extracting?
>             2.4.1  --> YES
>             2.6.0  --> YES
>             2.7.0  --> YES
>             2.8.0  --> YES
>             2.9.0  --> NO  
>             2.9.1  --> NO  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4170) Tika to extract Apple Key files

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4170:
--
Attachment: keynotecreated-2.9.3-SNAPSHOT.zip

> Tika to extract Apple Key files
> ---
>
> Key: TIKA-4170
> URL: https://issues.apache.org/jira/browse/TIKA-4170
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, 
> keynotecreated.zip
>
>
> We are trying Tika to extract Apple Key files.  The testing data is attached.
>     Could you please check why Tika can't extract the Apple Key files from 
> Tika-2.9.0? 
>     The below testing result is for your reference.  Thank you.
>  
> Tika version  --> Have child documents after extracting?
>             2.4.1  --> YES
>             2.6.0  --> YES
>             2.7.0  --> YES
>             2.8.0  --> YES
>             2.9.0  --> NO  
>             2.9.1  --> NO  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files

2024-10-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890091#comment-17890091
 ] 

Tim Allison commented on TIKA-4170:
---

I'm attaching the output for Tika 3.x's tika-app {{java -jar 
tika-app-3.0.1-SNAPSHOT.jar -J -t keynotecreated.key > keynotecreated.json}}.

With tesseract installed, Tika is extracting the attachments (well, images) and 
running those with tesseract.

How are you calling Tika? Which attachments are not processed in which files?

> Tika to extract Apple Key files
> ---
>
> Key: TIKA-4170
> URL: https://issues.apache.org/jira/browse/TIKA-4170
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: Apple_key_file.zip, keynotecreated.zip
>
>
> We are trying Tika to extract Apple Key files.  The testing data is attached.
>     Could you please check why Tika can't extract the Apple Key files from 
> Tika-2.9.0? 
>     The below testing result is for your reference.  Thank you.
>  
> Tika version  --> Have child documents after extracting?
>             2.4.1  --> YES
>             2.6.0  --> YES
>             2.7.0  --> YES
>             2.8.0  --> YES
>             2.9.0  --> NO  
>             2.9.1  --> NO  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4170) Tika to extract Apple Key files

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4170:
--
Attachment: keynotecreated.zip

> Tika to extract Apple Key files
> ---
>
> Key: TIKA-4170
> URL: https://issues.apache.org/jira/browse/TIKA-4170
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: Apple_key_file.zip, keynotecreated.zip
>
>
> We are trying Tika to extract Apple Key files.  The testing data is attached.
>     Could you please check why Tika can't extract the Apple Key files from 
> Tika-2.9.0? 
>     The below testing result is for your reference.  Thank you.
>  
> Tika version  --> Have child documents after extracting?
>             2.4.1  --> YES
>             2.6.0  --> YES
>             2.7.0  --> YES
>             2.8.0  --> YES
>             2.9.0  --> NO  
>             2.9.1  --> NO  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4316) Goals for Tika 4.x

2024-10-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4316:
--
Description: 
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project? Consider using 
pf4j in tika-pipes and other components?
4) Remove unsupported dl4j and sentiment analysis and agepredictor modules 
and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin or pf4j instead of the shade plugin
6) Use an auto-correcting linter instead of checkstyle (cosium with google's 
style format?)
7) Remove the legacy external parser mechanism in favor of the external2 
mechanism

  was:
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin
6) Use an auto-correcting linter instead of checkstyle
7) Remove the legacy external parser mechanism in favor of the external2 
mechanism


> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project? Consider 
> using pf4j in tika-pipes and other components?
> 4) Remove unsupported dl4j and sentiment analysis and agepredictor modules 
> and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin or pf4j instead of the shade plugin
> 6) Use an auto-correcting linter instead of checkstyle (cosium with google's 
> style format?)
> 7) Remove the legacy external parser mechanism in favor of the external2 
> mechanism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Tika Roadmap 2.x, 3.x and beyond

2024-10-16 Thread Tim Allison
All,
  I created a wiki page for the proposed roadmap:
https://cwiki.apache.org/confluence/display/TIKA/Tika+Roadmap+--+2.x%2C+3.x+and+Beyond
  If there are any objections or alternate proposals to my initial
proposal, please discuss those objections on the user/dev lists.
  Thank you, all!

   Cheers,

Tim


[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890037#comment-17890037
 ] 

Tim Allison commented on TIKA-4318:
---

My initial memory on this issue was that the autogenerated classes in grpc were 
causing problems. I then stubbed my toes for a while trying to get everything 
working and couldn't find any problems with grpc... until I went to make the 
release. During the final step in {{release:perform}}, javadoc errors out on 
grpc with the autogenerated classes.

I turned back on the "ignore javadocs" for grpc in 3.x.

Once we make the release, we can maybe move the autogenerated classes to their 
own package in 4.x and then configure javadoc to ignore that package. 

If anyone has any better ideas, please share. This was not a great way to spend 
time.



> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: 3.0.0 release?

2024-10-16 Thread Tim Allison
Y, I completely agree about the surprises. I don't want to dump more
work on you. :D

On Tue, Oct 15, 2024 at 8:51 AM Tilman Hausherr  wrote:
>
> On 08.10.2024 16:05, Tim Allison wrote:
> > I realize that even dependency maintenance on three concurrent branches
> > will be burdensome. Perhaps we fallback to "update dependencies before a
> > release and before the regression tests" at least on the 2.x and 3.x
> > branches?
>
> It wasn't burdensome for me, I've usually done this while watching TV.
> Doing it only once before the release could mean nasty surprises at the
> wrong moment.
>
> Tilman
>
>
>


[VOTE] Release Apache Tika 3.0.0 Candidate #1

2024-10-16 Thread Tim Allison
A candidate for the Tika 3.0.0 release is available at:
https://dist.apache.org/repos/dist/dev/tika/3.0.0

The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/3.0.0-rc1/

The SHA-512 checksum of the archive is
c5eb92bc895d96492b2d2577d14df6187e46ab7c8a9f64aaf19d4f140f07caf1223d073c2cbb47b5519bb952eee50f39563004b8ad49906f45dffc9b6df74350.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1107/org/apache/tika

Please vote on releasing this package as Apache Tika 3.0.0.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 3.0.0
[ ] -1 Do not release this package because...


Here's my +1.

Thank you, all!

Best,

 Tim


[jira] [Updated] (TIKA-4316) Goals for Tika 4.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4316:
--
Description: 
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin
6) Use an auto-correcting linter instead of checkstyle
7) Remove the legacy external parser mechanism in favor of the external2 
mechanism

  was:
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin
6) Use an auto-correcting linter instead of checkstyle


> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin
> 6) Use an auto-correcting linter instead of checkstyle
> 7) Remove the legacy external parser mechanism in favor of the external2 
> mechanism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4318:
--
Description: 
When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
aggregate}}. 

Let's see if we can get this to work for the 3.0.0 release?


  was:
When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
aggregate}}. My memory is that there was a problem with the autogenerated code 
in grpc. 

Let's see if we can get this to work for the 3.0.0 release?



> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889690#comment-17889690
 ] 

Tim Allison commented on TIKA-4318:
---

Unless there's a better solution, I'll add back {{sourcepath}} into the repo, 
but I'll manually comment it out when I generate the javadocs for the release.

There has to be a better option. :/

> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889684#comment-17889684
 ] 

Tim Allison edited comment on TIKA-4318 at 10/15/24 2:04 PM:
-

Removing {{src/main/java}} causes ci/cd to fail.

{noformat}
Failed to execute goal 
org.apache.maven.plugins:maven-javadoc-plugin:3.10.1:aggregate (default-cli) on 
project tika: An error has occurred in Javadoc report generation: 
2024-10-15T13:41:51.8978543Z [ERROR] Exit code: 2
2024-10-15T13:41:51.8979146Z [ERROR] error: No source files for package 
org.apache.tika.io
{noformat}

Locally, the build works with Java 11 and 17, but there are no javadocs output 
in {{target/reports}} or anywhere in {{target}}.



was (Author: talli...@mitre.org):
Removing {{src/main/java}} causes ci/cd to fail.

Locally, the build works with Java 11 and 17, but there are no javadocs output 
in {{target/reports}} or anywhere in {{target}}.


> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889684#comment-17889684
 ] 

Tim Allison commented on TIKA-4318:
---

Removing {{src/main/java}} causes ci/cd to fail.

Locally, the build works with Java 11 and 17, but there are no javadocs output 
in {{target/reports}} or anywhere in {{target}}.


> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-4318:
---

> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889653#comment-17889653
 ] 

Tim Allison commented on TIKA-4318:
---

I made a very small modification in main just now, and {{javadoc:aggregate}} 
appears to be working again without turning off javadocs for grpc and without 
modifying any of the code in the grpc module.

> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4318:
--
Description: 
When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
aggregate}}. My memory is that there was a problem with the autogenerated code 
in grpc. 

Let's see if we can get this to work for the 3.0.0 release?


  was:
When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the 
autogenerated code in grpc. It looks like we can exclude packages in javadoc if 
we move the autogenerated code to its own package.

Let's see if we can get this to work for the 3.0.0 release?


> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc 
> aggregate}}. My memory is that there was a problem with the autogenerated 
> code in grpc. 
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4318) Move auto-generated code to separate package

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4318.
---
Fix Version/s: 3.0.0
 Assignee: Tim Allison
   Resolution: Fixed

> Move auto-generated code to separate package
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the 
> autogenerated code in grpc. It looks like we can exclude packages in javadoc 
> if we move the autogenerated code to its own package.
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4318:
--
Summary: Fix javadoc aggregate in 3.x  (was: Move auto-generated code to 
separate package)

> Fix javadoc aggregate in 3.x
> 
>
> Key: TIKA-4318
> URL: https://issues.apache.org/jira/browse/TIKA-4318
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>    Assignee: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the 
> autogenerated code in grpc. It looks like we can exclude packages in javadoc 
> if we move the autogenerated code to its own package.
> Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4318) Move auto-generated code to separate package

2024-10-15 Thread Tim Allison (Jira)
Tim Allison created TIKA-4318:
-

 Summary: Move auto-generated code to separate package
 Key: TIKA-4318
 URL: https://issues.apache.org/jira/browse/TIKA-4318
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the 
autogenerated code in grpc. It looks like we can exclude packages in javadoc if 
we move the autogenerated code to its own package.

Let's see if we can get this to work for the 3.0.0 release?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4316) Goals for Tika 4.x

2024-10-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4316:
--
Description: 
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin
6) Use an auto-correcting linter instead of checkstyle

  was:
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin


> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin
> 6) Use an auto-correcting linter instead of checkstyle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: 3.0.0 release?

2024-10-15 Thread Tim Allison
With help from Tilman, I think we're all set on the TextCSVParser
regression. Any other blockers on 3.0.0?

Any objections to the basic plan below?

On Thu, Oct 10, 2024 at 7:03 AM Tim Allison  wrote:

> Regression results are available:
> https://issues.apache.org/jira/browse/TIKA-4280?focusedCommentId=17888235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17888235
>
> I heard one off-list +1, and no -1s publicly or privately. Please let me
> know if anyone has any misgivings about the following plan.
>
> On Tue, Oct 8, 2024 at 10:05 AM Tim Allison  wrote:
>
>> All,
>>
>>   In looking at my schedule and thinking about the project more broadly,
>> I'm worried that moving pipes to its own module in Tika 3.x after we've had
>> two beta releases might be a large step. What would you think about this
>> timeline:
>>
>> Oct 2024 -- Release 3.0.0 (after regression tests and fixes and with
>> warnings about upcoming changes in tika-pipes in 4.x)
>> Oct 2024 -- move main branch to 4.x (and Java 17) and move tika-pipes to
>> its own module or create a standalone tika-pipes project
>> ??? -- start 4.0.0-BETA releases ASAP
>> April 2025 (6 months from 3.x release) -- end support for 2.x (and
>> thereby Java 8)
>> April 2025 (6 months from Oct 2024 or earlier???) -- release 4.0.0
>> Oct 2025 (one year from 3.x release) -- end support for 3.x (and thereby
>> Java 11)
>>
>> I realize that even dependency maintenance on three concurrent branches
>> will be burdensome. Perhaps we fallback to "update dependencies before a
>> release and before the regression tests" at least on the 2.x and 3.x
>> branches?
>>
>> What do you all think? Many thanks!
>>
>>  Best,
>>
>> Tim
>>
>


[jira] [Commented] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/

2024-10-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889259#comment-17889259
 ] 

Tim Allison commented on TIKA-4317:
---

Deleted. Thank you for the report.

> Abusive content on https://corpora.tika.apache.org/
> ---
>
> Key: TIKA-4317
> URL: https://issues.apache.org/jira/browse/TIKA-4317
> Project: Tika
>  Issue Type: Bug
>  Components: site
>Reporter: Zoran Regvart
>Assignee: Tim Allison
>Priority: Major
>
> The Apache Camel team has been notified by Google of abusive content hosted 
> on https://corpora.tika.apache.org/, with the assumption that this is somehow 
> related to https://camel.apache.org. The scanning done by Google is against 
> the whole apache.org domain, so implication is that any abusive content found 
> on any domain within apache.org will be accredited and affect other domains 
> within apache.org.
> Learn about abusive experiences here: 
> https://support.google.com/webtools/answer/7347327.
> Singled out page from Google report (content & possibly security warning):
> {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/

2024-10-14 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4317.
---
Resolution: Fixed

> Abusive content on https://corpora.tika.apache.org/
> ---
>
> Key: TIKA-4317
> URL: https://issues.apache.org/jira/browse/TIKA-4317
> Project: Tika
>  Issue Type: Bug
>  Components: site
>Reporter: Zoran Regvart
>    Assignee: Tim Allison
>Priority: Major
>
> The Apache Camel team has been notified by Google of abusive content hosted 
> on https://corpora.tika.apache.org/, with the assumption that this is somehow 
> related to https://camel.apache.org. The scanning done by Google is against 
> the whole apache.org domain, so implication is that any abusive content found 
> on any domain within apache.org will be accredited and affect other domains 
> within apache.org.
> Learn about abusive experiences here: 
> https://support.google.com/webtools/answer/7347327.
> Singled out page from Google report (content & possibly security warning):
> {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Is there a way to publish to docker.io/apache ?

2024-10-11 Thread Tim Allison
Last I looked into this, I think infra granted it to three (?) people per
project. Maybe check with them and see if that still applies and then see
who has karma that might be willing to relinquish it?

On Fri, Oct 11, 2024 at 12:22 PM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> I have an image of Apache Tika Grpc that is on Dockerhub here:
>
> ndipiazza/tika-grpc:3.0.0-BETA2
>
> I have some interest in putting that in an official
> docker.io/apache/tika-grpc:3.0.0-BETA2
>
> Is this possible to get a dockerhub account for my apache credentials?
>


[jira] [Comment Edited] (TIKA-4316) Goals for Tika 4.x

2024-10-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888628#comment-17888628
 ] 

Tim Allison edited comment on TIKA-4316 at 10/11/24 12:29 PM:
--

All of this is still open for discussion obv -- we still haven't even launched 
3.x -- and I look forward to discussion


was (Author: talli...@mitre.org):
All of this is still open for discussion, and I look forward to discussion

> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4316) Goals for Tika 4.x

2024-10-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888628#comment-17888628
 ] 

Tim Allison commented on TIKA-4316:
---

All of this is still open for discussion, and I look forward to discussion

> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4316) Goals for Tika 4.x

2024-10-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4316:
--
Summary: Goals for Tika 4.x  (was: Goals for tika 4.x)

> Goals for Tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4316) Goals for tika 4.x

2024-10-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4316:
--
Description: 
I proposed a tentative roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x

Some thoughts:
1) Require Java 17
2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
3) Move tika-pipes to a separate module. Consider moving non-trivial 
implementations of tika-pipes components to a separate project?
4) Remove unsupported dl4j and sentiment analysis modules and...? 
5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
pattern with the assembly plugin instead of the shade plugin

  was:
I proposed a roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x


> Goals for tika 4.x
> --
>
> Key: TIKA-4316
> URL: https://issues.apache.org/jira/browse/TIKA-4316
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I proposed a tentative roadmap here: 
> https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z
> Let's use this ticket to discuss some high level changes in 4.x
> Some thoughts:
> 1) Require Java 17
> 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies
> 3) Move tika-pipes to a separate module. Consider moving non-trivial 
> implementations of tika-pipes components to a separate project?
> 4) Remove unsupported dl4j and sentiment analysis modules and...? 
> 5) Avoid fat jars where possible -- at least move tika-server to a lib/* 
> pattern with the assembly plugin instead of the shade plugin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4316) Goals for tika 4.x

2024-10-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-4316:
-

 Summary: Goals for tika 4.x
 Key: TIKA-4316
 URL: https://issues.apache.org/jira/browse/TIKA-4316
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I proposed a roadmap here: 
https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z

Let's use this ticket to discuss some high level changes in 4.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >