[jira] [Commented] (TIKA-4685) Add a new charset detector for 4.x

Hudson (Jira) Sat, 07 Mar 2026 11:15:10 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063780#comment-18063780
 ]


Hudson commented on TIKA-4685:
------------------------------

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk17 #1243 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/1243/])
TIKA-4685 - add annotation processor for jdk >23 (#2679) (github: 
[https://github.com/apache/tika/commit/b3023c47bcddc80e85e450f2801c5386859e74f8])
* (edit) tika-encoding-detectors/pom.xml


> Add a new charset detector for 4.x
> ----------------------------------
>
>                 Key: TIKA-4685
>                 URL: https://issues.apache.org/jira/browse/TIKA-4685
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> While I was building out the maxent model for the updated language detector, 
> I realized we had the resources (language files by language) and a maxent 
> model just sitting around and ready to build a new charset detector based on 
> byte ngrams.
> I have something working that appears to be quite good. We can replace both 
> universal and icu4j. There's a chance that the results are hallucinated or 
> that there's something surprising going on, but I think we should merge this 
> and see what happens on our regression set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4685) Add a new charset detector for 4.x

Reply via email to