On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <izavo...@caci.com> wrote:
> I need to implement a "quick and dirty" or "poor man's" translation of a > foreign language document by looking up each word in a dictionary and > replacing it with the English translation. So what I need is to tokenize > the original foreign text into words and then access each word, look it up > and get its translation. However, if possible, I also need to preserve > "non-words", i.e. stopwords so that I could replicate them in the output > stream without translating. If the latter is not possible then I just need > to preserve the order of the original words so that their translations have > the same order in the output. > > Can I accomplish this using Lucene components? I presume I'd have to start > by creating an analyzer for the foreign language, but then what? How do I > (i) tokenize, (ii) access words in the correct order, (iii) also access > non-words if possible? > You can always use something like StandardAnalyzer for the specific language, with an empty stopword list (so that no words are treated as stopwords). A bit trickier might be dealing with punctuation - depending on the analyzer, you might be able to get these to parse as separate tokens. -- Avi > > Thanks much > > > Ilya Zavorin > > >