Two options:
1) send the extracted text to the /language endpoint (
https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-LanguageResource
).
2) If you are using the /rmeta endpoint or the json output from the /tika
endpoint, you can get language id from a slightly different lang id
mechanism via tika-eval.  Add the tika-eval.jar to your class path (see:
https://cwiki.apache.org/confluence/display/TIKA/TikaServer 's section
titled "Integration with tika-eval").

On Mon, Apr 17, 2023 at 8:16 AM Chetan Bikire <chetab...@gmail.com> wrote:

> As I am using the tika 2.7 server standard runnable jar package
> and which has a built- in language detection feature I believe, do we need
> to do any other configuration or need to install any other extension in
> order to achieve language detection as mentioned below.
>
> [image: image.png]
>
> Please assist.
> Thanks
>
> On Fri, Apr 14, 2023, 22:05 Chetan Bikire <chetab...@gmail.com> wrote:
>
>> I too didn't find any metadata for language, but thought using tika
>> language detector extension can be able to get it.
>>
>> org.apache.tika.language.detect.LanguageDetector
>>
>> On Wed, Apr 12, 2023, 22:38 Tim Allison <talli...@apache.org> wrote:
>>
>>> I'm not seeing language hints in the document.xml within the docx nor
>>> in the metadata.  Do you know where it might be stored inside the
>>> docx?
>>>
>>> On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <chetab...@gmail.com>
>>> wrote:
>>> >
>>> > I am calling tika using rmeta/text endpoint by running tika server 2.7.
>>> > Yes, language detection means any metadata field which shows language
>>> in which document is written.
>>> > like for example- in our case attached document contains spanish
>>> content in it then metadata Content-Language:"es"
>>> >
>>> >
>>> >
>>> > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <talli...@apache.org>
>>> wrote:
>>> >>
>>> >> How are you calling Tika?  By "language", do you mean language
>>> >> detection on the extracted text or an internal metadata flag that says
>>> >> "I'm X language"?
>>> >>
>>> >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <chetab...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > After parsing documents tika does not return language as part
>>> parsing result for some of the documents like docx,.msg files.
>>> >> > Below is the example document.
>>> >> > please assist.
>>>
>>

Reply via email to