How are you extracting the text that is there in the website[1] you are
referring to? Apache Nutch or any other crawler? If yes, initially check
whether that crawler engine is giving you data in correct format before you
invoke solr index method.

[1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

URI encoding should resolve this problem.




On Fri, Nov 1, 2013 at 10:50 AM, Chris <christu...@gmail.com> wrote:

> Hi Rajani,
>
> I followed the steps exactly as in
>
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
>
> However, when i send a query to this new instance in tomcat, i again get
> the error -
>
>   <str name="fulltxt">Scheduled Groups Maintenance
> In preparation for the new release roll-out,���� Diigo groups won’t be
> accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> hours.
> Stay tuned to say hello to Diigo V4 soon!
>
> location of the text  -
> http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>
> same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
>
> All text in title comes like -
>
> ������������������������������������ - ���������������������
> ������������</str>
>     <arr name="text">
>       <str>������������������������������������ -
> ��������������������� ������������</str>
>     </arr>
>
>
> Can you please advice?
>
> Chris
>
>
>
>
> On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <rajinima...@gmail.com
> >wrote:
>
> > Hi,
> >
> >    If you are using Apache Tomcat Server, hope you are not missing the
> > below mentioned configuration:
> >
> >  <Connector port=”port Number″ protocol=”HTTP/1.1″
> > connectionTimeout=”20000″
> > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> >
> > I had faced similar issue with Chinese Characters and had resolved with
> the
> > above config.
> >
> > Links for reference :
> >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> >
> >
> > Thanks
> >
> >
> >
> > On Tue, Oct 29, 2013 at 9:20 PM, Chris <christu...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I get characters like -
> > >
> > > ������������������ - CTA������������ -
> > >
> > > in the solr index. I am adding Java beans to solr by the addBean()
> > > function.
> > >
> > > This seems to be a character encoding issue. Any pointers on how to
> > > resolve this one?
> > >
> > > I have seen that this occurs  mostly for japanese chinese characters.
> > >
> >
>

Reply via email to