Re: DIH import from MySQL results in garbage text for special chars

Pranav Prakash Thu, 20 Sep 2012 22:50:16 -0700

I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
it is the same. My servlet container is Jetty which is the out-of-box one.
Many other special chars are getting indexed and stored properly, only few
characters causes pain.


*Pranav Prakash*

"temet nosce"



On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Is your _browser_ set to handle the appropriate character set? Or whatever
> you're using to inspect your data? How about your servlet container?
>
>
>
> Best
> Erick
>
> On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash <pra...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am attempting to import documents to Solr from MySQL using DIH. One of
> > the field contains the text - “Future of Mobile Value Added Services
> (VAS)
> > in Australia” .Notice the character “ and ”.
> >
> > When I am importing, it gets stored as - â€œFuture of Mobile Value Added
> > Services (VAS) in Australiaâ€�.
> >
> > The datasource config clearly mentions use of UTF-8 as follows:
> >
> >   <dataSource type="JdbcDataSource"
> >     driver="com.mysql.jdbc.Driver"
> >     url="jdbc:mysql://localhost/ohapp_devel"
> >     user="username"
> >     useUnicode="true"
> >     characterEncoding="UTF-8"
> >     password="password"
> >     zeroDateTimeBehavior="convertToNull"
> >     name="app" />
> >
> >
> > A plain SQL Select statement on the MySQL Console gives appropriate
> text. I
> > even tried using following scriptTransformer to get rid of this char, but
> > it was of no particular use in my case.
> >
> > function gsub(source, pattern, replacement) {
> >       var match, result;
> >       if (!((pattern != null) && (replacement != null))) {
> >         return source;
> >       }
> >       result = '';
> >       while (source.length > 0) {
> >         if ((match = source.match(pattern))) {
> >           result += source.slice(0, match.index);
> >           result += replacement;
> >           source = source.slice(match.index + match[0].length);
> >         } else {
> >           result += source;
> >           source = '';
> >         }
> >       }
> >       return result;
> >     }
> >
> >     function fixQuotes(c){
> >       c = gsub(c, /\342\200(?:\234|\235)/,'"');
> >       c = gsub(c, /\342\200(?:\230|\231)/,"'");
> >       c = gsub(c, /\342\200\223/,"-");
> >       c = gsub(c, /\342\200\246/,"...");
> >       c = gsub(c, /\303\242\342\202\254\342\204\242/,"'");
> >       c = gsub(c, /\303\242\342\202\254\302\235/,'"');
> >       c = gsub(c, /\303\242\342\202\254\305\223/,'"');
> >       c = gsub(c, /\303\242\342\202\254"/,'-');
> >       c = gsub(c, /\342\202\254\313\234/,'"');
> >       c = gsub(c, /“/, '"');
> >       return c;
> >     }
> >
> > function cleanFields(row){
> >       var fieldsToClean = ['title', 'description'];
> >       for(i =0; i< fieldsToClean.length; i++){
> >         var old_text = String(row.get(fieldsToClean[i]));
> >         row.put(fieldsToClean[i], fixQuotes(old_text) );
> >       }
> >       return row;
> >     }
> >
> > My understanding goes that this must be a very common problem. It also
> > occurs with human names which have these chars. What is an appropriate
> way
> > to get the appropriate text indexed and searchable? The fieldtype where
> > this is stored goes as follows
> >
> >   <fieldType name="text_commongrams" class="solr.TextField">
> >     <analyzer>
> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
> >       <tokenizer class="solr.StandardTokenizerFactory" />
> >       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> >       <filter class="solr.TrimFilterFactory" />
> >       <filter class="solr.LowerCaseFilterFactory" />
> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >     <filter class="solr.SynonymFilterFactory"
> >       synonyms="synonyms.txt"
> >       ignoreCase="true"
> >       expand="true" />
> >     <filter class="solr.CommonGramsFilterFactory"
> >       words="stopwords_en.txt"
> >       ignoreCase="true" />
> >     <filter class="solr.StopFilterFactory"
> >       words="stopwords_en.txt"
> >       ignoreCase="true" />
> >     <filter class="solr.WordDelimiterFilterFactory"
> >       generateWordParts="1"
> >       generateNumberParts="1"
> >       catenateWords="1"
> >       catenateNumbers="1"
> >       catenateAll="0"
> >       preserveOriginal="1" />
> >   </analyzer>
> > </fieldType>
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
>

Re: DIH import from MySQL results in garbage text for special chars

Reply via email to