Re: tokenizing text using language analyzer but preserving stopwords if possible

KARTHIK SHIVAKUMAR Sun, 11 Dec 2011 05:38:54 -0800

Hi

>> tokenize the original foreign text into words


Need to Identify the Appropriate analyzer ( foreign language before
Indexing ...)


with regards
karthik


On Wed, Dec 7, 2011 at 4:57 PM, Avi Rosenschein <arosensch...@gmail.com>wrote:

> On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <izavo...@caci.com> wrote:
>
> > I need to implement a "quick and dirty" or "poor man's" translation of a
> > foreign language document by looking up each word in a dictionary and
> > replacing it with the English translation. So what I need is to tokenize
> > the original foreign text into words and then access each word, look it
> up
> > and get its translation. However, if possible, I also need to preserve
> > "non-words", i.e. stopwords so that I could replicate them in the output
> > stream without translating. If the latter is not possible then I just
> need
> > to preserve the order of the original words so that their translations
> have
> > the same order in the output.
> >
> > Can I accomplish this using Lucene components? I presume I'd have to
> start
> > by creating an analyzer for the foreign language, but then what? How do I
> > (i) tokenize, (ii) access words in the correct order, (iii) also access
> > non-words if possible?
> >
>
> You can always use something like StandardAnalyzer for the specific
> language, with an empty stopword list (so that no words are treated as
> stopwords). A bit trickier might be dealing with punctuation - depending on
> the analyzer, you might be able to get these to parse as separate tokens.
>
> -- Avi
>
>
> >
> > Thanks much
> >
> >
> > Ilya Zavorin
> >
> >
> >
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Re: tokenizing text using language analyzer but preserving stopwords if possible

Reply via email to