Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-11 Thread KK
Thank you very much Yonik. I downloaded the latest Solr build, pulled the WordDelimiterFilter and used it with the same option as used by Solr default and it worked like a charm. Thanks to Robert also. Thanks, KK On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley wrote: > I just cut'n'pasted your wor

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-11 Thread KK
Note: I request Solr users to go through this mail and let me thier ideas. Thanks Yonik, you rightly pointed it out. That clearly says that the way I'm trying to mimic the default behaviour of Solr indexing/searching in Lucene is wrong, right?. I downloaded the latest version of solr nightly on m

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-09 Thread Yonik Seeley
I just cut'n'pasted your word into Solr... it worked fine (it didn't split the word). Make sure you're using the latest from the trunk version of Solr... this was fixed since 1.3 http://localhost:8983/solr/select?q=साल&debugQuery=true [...] साल साल text:साल text:साल -Yonik On Tue, Jun

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-09 Thread KK
> > > > > > Thank you all. > >> >> >> > > > > > > > > To be frank I was using Solr in the begining half > a > >> >> month > >> >> >> > ago. > >> >>

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-08 Thread Robert Muir
> > > > fly. >> >> >> > > > > > > > Though >> >> >> > > > > > > > > they have a restful method for teh same, but it was >> not >> >> >> > > working. >> >> >> >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-07 Thread KK
t; > > > > > > couple of days stuck at that made me think of lucene > and > >> I > >> >> > > > switched > >> >> > > > > > to > >> >> > > > > > > > it. > >> >> > &

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-06 Thread Robert Muir
> > > to >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to >> the >> >> > > > point >> >> > > > > as >> >> > > > > > >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-06 Thread KK
e job for me, handling a mix of both > english > >> > and > >> > > > > > > > non-english > >> > > > > > > > > content. > >> > > > > > > > > Muir, can you give me a bit detail descripti

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread KK
t; > > > > > solr > >> > > > > > > > > that will do the job for me, handling a mix of both > english > >> > and > >> > > > > > > > non-english > >> > > > > &

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread Robert Muir
; > > the >> > > > > > > > > WordDelimiteFilter to do my job. >> > > > > > > > > On a side note, I was thingking of writing a simple analyzer >> > > that >> > > > > > will >> > > > > > > do >> > > > > > > > > the

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread Robert Muir
; > > > > > with. > > > > > > > As > > > > > > > > I > > > > > > > > > know its in UCN unicode something like > > > > > \u0021\u0012\u34ae\u0031[just > > > > > > a > > > > > > > > > sample] > > > > > >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread KK
gt; > > > > > > Now to get all this, > > > > > > > > #1. I need some sort of way which will let me know if the > > > content > > > > is > > > > > > > > english or not. If not english just add the tokens to

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread Robert Muir
y other > > content > > > > that > > > > > > > uses > > > > > > > the same script as english other than those \u1234 things for > my > > > > indian > > > > > > > language content. Any smart hack/trick for the same? > > &g

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread KK
allyadd the newly created > > document > > > > to > > > > > > the > > > > > > index. > > > > > > > > > > > > I would like some one to guide me in this direction. I'm pretty > > > people > >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread Robert Muir
to some tutorials for the same. > > > > > Else help me out writing a custom analyzer only if thats not going > to > > > be > > > > > too > > > > > complex. LOL, I'm a new user to lucene and know basics of Java > > coding. > > &

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-05 Thread KK
; > > too > > > > complex. LOL, I'm a new user to lucene and know basics of Java > coding. > > > > Thank you very much. > > > > > > > > --KK. > > > > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 5

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
> > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir > wrote: > > > > > > > > > yes this is true. for starters KK, might be good to startup solr > and > > > look > > > > > at > > >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
t; yes this is true. for starters KK, might be good to startup solr and > > look > > > > at > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter is the > piec

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
r text, mainly for punctuation but also for format > > > characters such as ZWJ/ZWNJ. > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler wrote: > > > > > > > You can also re-use the solr analyzers, as far as I found out. There > is > > > an

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
uwe what KK needs here is 'proper unicode handling'. since the latest WordDelimiterFilter has pretty good handling of unicode categories, combining this with WhiteSpaceTokenizer effectively gives you a pretty good solution for unicode tokenization. KK doesn't need detection of anything, the porte

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
here is > > an > > > issue in jIRA/discussion on java-dev to merge them. > > > > > > - > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > &g

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
Uwe, thanks for your lightening fast reponse :-). I'm looking into that and let me see how far I can go...Also I request Muir to point me to the exact analyzer he mentiioned in thr previous mail. Thanks, KK On Thu, Jun 4, 2009 at 6:10 PM, Uwe Schindler wrote: > > I request Uwe to give me some

RE: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Uwe Schindler
> I request Uwe to give me some more ideas on using the analyzers from solr > that will do the job for me, handling a mix of both english and non- > english content. Look here: http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h tml As you see, the Solr analyzers are just

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread KK
> > eMail: u...@thetaphi.de > > > > > > > -Original Message- > > > From: Robert Muir [mailto:rcm...@gmail.com] > > > Sent: Thursday, June 04, 2009 1:18 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: How to sup

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
hetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Robert Muir [mailto:rcm...@gmail.com] > > Sent: Thursday, June 04, 2009 1:18 PM > > To: java-user@lucene.apache.org > > Subject: Re: How to support stemming and case folding for en

RE: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Uwe Schindler
to:rcm...@gmail.com] > Sent: Thursday, June 04, 2009 1:18 PM > To: java-user@lucene.apache.org > Subject: Re: How to support stemming and case folding for english content > mixed with non-english content? > > KK, ok, so you only really want to stem the english. This is good. >

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-04 Thread Robert Muir
KK, ok, so you only really want to stem the english. This is good. Is it possible for you to consider using solr? solr's default analyzer for type 'text' will be good for your case. it will do the following 1. tokenize on whitespace 2. handle both indian language and english punctuation 3. lowerca

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread KK
Muir, thanks for your response. I'm indexing indian language web pages which has got descent amount of english content mixed with therein. For the time being I'm not going to use any stemmers as we don't have standard stemmers for indian languages . So what I want to do is like this, Say I've a web

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread Robert Muir
KK, is all of your latin script text actually english? Is there stuff like german or french mixed in? And for your non-english content (your examples have been indian writing systems), is it generally true that if you had devanagari, you can assume its hindi? or is there stuff like marathi mixed i

How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread KK
Hi All, I'm indexing some non-english content. But the page also contains english content. As of now I'm using WhitespaceAnalyzer for all content and I'm storing the full webpage content under a single filed. Now we require to support case folding and stemmming for the english content intermingled