[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasad Deshpande updated SOLR-2346: ----------------------------------- Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. (was: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine is booted in Japanese Locale.) > Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no > getting indexed correctly. > ------------------------------------------------------------------------------------------------------- > > Key: SOLR-2346 > URL: https://issues.apache.org/jira/browse/SOLR-2346 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 1.4.1 > Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows > XP SP1, Machine was booted in Japanese Locale. > Reporter: Prasad Deshpande > Priority: Critical > Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt > > > I am able to successfully index/search non-Engilsh files (like Hebrew, > Japanese) which was encoded in UTF-8. However, When I tried to index data > which was encoded in local encoding like Big5 for Japanese I could not see > the desired results. The contents after indexing looked garbled for Big5 > encoded document when I searched for all indexed documents. When I index > attached non utf-8 file it indexes in following way > - <result name="response" numFound="1" start="0"> > - <doc> > - <arr name="attr_content"> > <str>�� ������</str> > </arr> > - <arr name="attr_content_encoding"> > <str>Big5</str> > </arr> > - <arr name="attr_content_language"> > <str>zh</str> > </arr> > - <arr name="attr_language"> > <str>zh</str> > </arr> > - <arr name="attr_stream_size"> > <str>17</str> > </arr> > - <arr name="content_type"> > <str>text/plain</str> > </arr> > <str name="id">doc2</str> > </doc> > </result> > </response> > Here you said it index file in UTF8 however it seems that non UTF8 file gets > indexed in Big5 encoding. > Here I tried fetching indexed data stream in Big5 and converted in UTF8. > String id = (String) resulDocument.getFirstValue("attr_content"); > byte[] bytearray = id.getBytes("Big5"); > String utf8String = new String(bytearray, "UTF-8"); > It does not gives expected results. > When I index UTF-8 file it indexes like following > - <doc> > - <arr name="attr_content"> > <str>マイ ネットワーク</str> > </arr> > - <arr name="attr_content_encoding"> > <str>UTF-8</str> > </arr> > - <arr name="attr_stream_content_type"> > <str>text/plain</str> > </arr> > - <arr name="attr_stream_name"> > <str>sample_jap_unicode.txt</str> > </arr> > - <arr name="attr_stream_size"> > <str>28</str> > </arr> > - <arr name="attr_stream_source_info"> > <str>myfile</str> > </arr> > - <arr name="content_type"> > <str>text/plain</str> > </arr> > <str name="id">doc2</str> > </doc> > So, I can index and search UTF-8 data. > For more reference below is the discussion with Yonik. > Please find attached TXT file which I was using to index and search. > curl > "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true&charset=utf-8" > -F "myfile=@sample_jap_non_UTF-8" > One problem is that you are giving big5 encoded text to Solr and saying that > it's UTF8. > Here's one way to actually tell solr what the encoding of the text you are > sending is: > curl > "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true" > --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; > charset=big5' > Now the problem appears that for some reason, this doesn't work... > Could you open a JIRA issue and attach your two test files? > -Yonik > http://lucidimagination.com -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org