Re: Language Identification in index time

Jack Krupansky Sun, 20 Jan 2013 07:42:16 -0800

It sounds like you want an update request processor:
http://wiki.apache.org/solr/UpdateRequestProcessor

But, it also sounds like you should probably be normalizing the encodingbefore sending the data to Solr.


-- Jack Krupansky

-----Original Message-----From: Yewint Ko

Sent: Sunday, January 20, 2013 10:36 AM
To: solr-user@lucene.apache.org
Subject: Language Identification in index time

Hi all

I am very new to solr and nutch. Currently i have a requirement to develop asmall search engine for local movie websites. Because non standard encodingsystem currently using on many of our local websites, it become necessaryfor us to develop encoding identifier and converter in web crawling,indexing and query processing. The idea is we will identify the encodingused on the website, convert (if necessary) and store the index in unicodestandard.

We have developed our own identifier and converter (solr SearchComponent)that can be used in query time to identify the encoding of the user queryand convert it to match the index.

The problem I am having is that I dont know how to intercept the request inindexing time for identifying and converting purpose. Is there somethinglike filter chain that can access the text before passing it to tokenizer,so that we can access the text and detect which encoding it is.


Thanks

yewint

Re: Language Identification in index time

Reply via email to