It sounds like you want an update request processor:
http://wiki.apache.org/solr/UpdateRequestProcessor

But, it also sounds like you should probably be normalizing the encoding before sending the data to Solr.

-- Jack Krupansky

-----Original Message----- From: Yewint Ko
Sent: Sunday, January 20, 2013 10:36 AM
To: solr-user@lucene.apache.org
Subject: Language Identification in index time

Hi all

I am very new to solr and nutch. Currently i have a requirement to develop a small search engine for local movie websites. Because non standard encoding system currently using on many of our local websites, it become necessary for us to develop encoding identifier and converter in web crawling, indexing and query processing. The idea is we will identify the encoding used on the website, convert (if necessary) and store the index in unicode standard.

We have developed our own identifier and converter (solr SearchComponent) that can be used in query time to identify the encoding of the user query and convert it to match the index.

The problem I am having is that I dont know how to intercept the request in indexing time for identifying and converting purpose. Is there something like filter chain that can access the text before passing it to tokenizer, so that we can access the text and detect which encoding it is.

Thanks
yewint

Reply via email to