Re: Problem with cyrillics letters through Tika OCR indexing
The same problem for me. So, first case probably or how to force tika parser recognize cyrillic character as required. For me it tries to recognize russian text as eng translit, show up in result russian text utilize only latin alphabet. 10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch" < arafa...@gmail.com> написал: > At what level is this exactly a problem? Are you looking for a way for > Solr to pass -L rus flag to Tika? > > Or you are saying that whatever OCR is used here is bad. In the second > case, this is probably not a question for Solr or even Tika but for > whatever underlying OCR library is. > > The stack is deep here, more precision is required. > > Удачи, > Alex > > On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" < > igor.abras...@novatek.ru> wrote: > > Hello, everyone I’m encountered the error mentioned at the title? > > The original image attached and recognized text below: > 3ApaBCTyI7ITe 9| )KVIBy xopomo > > > > Does anyone faced the similar? > Need to mentioned that tesseract recognize it more correctly with –l rus > option. > > Thanks in advance! > > > > > > *С уважением, * > > *Игорь Абрашин* > > *ООО «НОВАТЭК НТЦ»* > > *тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>* > > *тел. внутр. корпор.: 22-586* > > [image: 121] > > > > >
Re: Problem with cyrillics letters through Tika OCR indexing
At what level is this exactly a problem? Are you looking for a way for Solr to pass -L rus flag to Tika? Or you are saying that whatever OCR is used here is bad. In the second case, this is probably not a question for Solr or even Tika but for whatever underlying OCR library is. The stack is deep here, more precision is required. Удачи, Alex On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович"wrote: Hello, everyone I’m encountered the error mentioned at the title? The original image attached and recognized text below: 3ApaBCTyI7ITe 9| )KVIBy xopomo Does anyone faced the similar? Need to mentioned that tesseract recognize it more correctly with –l rus option. Thanks in advance! *С уважением, * *Игорь Абрашин* *ООО «НОВАТЭК НТЦ»* *тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>* *тел. внутр. корпор.: 22-586* [image: 121]
Problem with cyrillics letters through Tika OCR indexing
Hello, everyone I'm encountered the error mentioned at the title? The original image attached and recognized text below: 3ApaBCTyI7ITe 9| )KVIBy xopomo Does anyone faced the similar? Need to mentioned that tesseract recognize it more correctly with -l rus option. Thanks in advance! С уважением, Игорь Абрашин ООО <НОВАТЭК НТЦ> тел. раб.: +7 (3452) 680-386 тел. внутр. корпор.: 22-586 [121]