Two options: 1) send the extracted text to the /language endpoint ( https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-LanguageResource ). 2) If you are using the /rmeta endpoint or the json output from the /tika endpoint, you can get language id from a slightly different lang id mechanism via tika-eval. Add the tika-eval.jar to your class path (see: https://cwiki.apache.org/confluence/display/TIKA/TikaServer 's section titled "Integration with tika-eval").
On Mon, Apr 17, 2023 at 8:16 AM Chetan Bikire <chetab...@gmail.com> wrote: > As I am using the tika 2.7 server standard runnable jar package > and which has a built- in language detection feature I believe, do we need > to do any other configuration or need to install any other extension in > order to achieve language detection as mentioned below. > > [image: image.png] > > Please assist. > Thanks > > On Fri, Apr 14, 2023, 22:05 Chetan Bikire <chetab...@gmail.com> wrote: > >> I too didn't find any metadata for language, but thought using tika >> language detector extension can be able to get it. >> >> org.apache.tika.language.detect.LanguageDetector >> >> On Wed, Apr 12, 2023, 22:38 Tim Allison <talli...@apache.org> wrote: >> >>> I'm not seeing language hints in the document.xml within the docx nor >>> in the metadata. Do you know where it might be stored inside the >>> docx? >>> >>> On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <chetab...@gmail.com> >>> wrote: >>> > >>> > I am calling tika using rmeta/text endpoint by running tika server 2.7. >>> > Yes, language detection means any metadata field which shows language >>> in which document is written. >>> > like for example- in our case attached document contains spanish >>> content in it then metadata Content-Language:"es" >>> > >>> > >>> > >>> > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <talli...@apache.org> >>> wrote: >>> >> >>> >> How are you calling Tika? By "language", do you mean language >>> >> detection on the extracted text or an internal metadata flag that says >>> >> "I'm X language"? >>> >> >>> >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <chetab...@gmail.com> >>> wrote: >>> >> > >>> >> > Hi, >>> >> > >>> >> > After parsing documents tika does not return language as part >>> parsing result for some of the documents like docx,.msg files. >>> >> > Below is the example document. >>> >> > please assist. >>> >>