Hi >> tokenize the original foreign text into words
Need to Identify the Appropriate analyzer ( foreign language before Indexing ...) with regards karthik On Wed, Dec 7, 2011 at 4:57 PM, Avi Rosenschein <arosensch...@gmail.com>wrote: > On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <izavo...@caci.com> wrote: > > > I need to implement a "quick and dirty" or "poor man's" translation of a > > foreign language document by looking up each word in a dictionary and > > replacing it with the English translation. So what I need is to tokenize > > the original foreign text into words and then access each word, look it > up > > and get its translation. However, if possible, I also need to preserve > > "non-words", i.e. stopwords so that I could replicate them in the output > > stream without translating. If the latter is not possible then I just > need > > to preserve the order of the original words so that their translations > have > > the same order in the output. > > > > Can I accomplish this using Lucene components? I presume I'd have to > start > > by creating an analyzer for the foreign language, but then what? How do I > > (i) tokenize, (ii) access words in the correct order, (iii) also access > > non-words if possible? > > > > You can always use something like StandardAnalyzer for the specific > language, with an empty stopword list (so that no words are treated as > stopwords). A bit trickier might be dealing with punctuation - depending on > the analyzer, you might be able to get these to parse as separate tokens. > > -- Avi > > > > > > Thanks much > > > > > > Ilya Zavorin > > > > > > > -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*