Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread KK
Muir, thanks for your response. I'm indexing indian language web pages which has got descent amount of english content mixed with therein. For the time being I'm not going to use any stemmers as we don't have standard stemmers for indian languages . So what I want to do is like this, Say I've a web

Extending StandardAnalyzer considered harmful

2009-06-03 Thread Daniel Noll
Hi all. I just want to tell some people an interesting story. :-) We had a custom analyser which was implemented like this: public class NoStopWordsAnalyser extends StandardAnalyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = ne

Re: Phrase Highlighting

2009-06-03 Thread Max Lynch
On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller wrote: > Max Lynch wrote: > >> Well what happens is if I use a SpanScorer instead, and allocate it like >>> >>> >> >> >> >>> such: analyzer = StandardAnalyzer([]) tokenStream = analyzer.tokenStream("contents", luc

Re: Phrase Highlighting

2009-06-03 Thread Mark Miller
Max Lynch wrote: Well what happens is if I use a SpanScorer instead, and allocate it like such: analyzer = StandardAnalyzer([]) tokenStream = analyzer.tokenStream("contents", lucene.StringReader(text)) ctokenStream = lucene.CachingTokenFilter(tokenStre

Re: Lucene Website Integration

2009-06-03 Thread Gary Moore
I would suggest you take a look at Solr -- http://lucene.apache.org/solr -- which requires essentially no Java knowledge to use. It has a Python client which at the very least might help with the learning curve. If you want to try an alternative to JSP/Servlets for your web framework, there'

Lucene Website Integration

2009-06-03 Thread listan...@gmail.com
Hi all, I need to develop a website that allows for searching and browsing the underlying documents collection. I am going to be using Lucene as the underlying search engine. I am however not very familiar with web development, and am new to Lucene as well. I have used JSP/Servlets before, and sin

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bradford Stephens
Sorry, no videos this time. The conversation wasn't very structured... next month I'll record it :) On Wed, Jun 3, 2009 at 1:59 PM, Bhupesh Bansal wrote: > Great Bradford, > > Can you post some videos if you have some ? > > Best > Bhupesh > > > > On 6/3/09 11:58 AM, "Bradford Stephens" > wrote:

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bradford Stephens
Hey everyone! I just wanted to give a BIG THANKS for everyone who came. We had over a dozen people, and a few got lost at UW :) [I would have sent this update earlier, but I flew to Florida the day after the meeting]. If you didn't come, you missed quite a bit of learning and topics. Such as: -B

Re: How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread Robert Muir
KK, is all of your latin script text actually english? Is there stuff like german or french mixed in? And for your non-english content (your examples have been indian writing systems), is it generally true that if you had devanagari, you can assume its hindi? or is there stuff like marathi mixed i

How to support stemming and case folding for english content mixed with non-english content?

2009-06-03 Thread KK
Hi All, I'm indexing some non-english content. But the page also contains english content. As of now I'm using WhitespaceAnalyzer for all content and I'm storing the full webpage content under a single filed. Now we require to support case folding and stemmming for the english content intermingled

Re: Index and search terms containing character "-"

2009-06-03 Thread Erick Erickson
Just be aware that KeywordAnalyzer won't tokenize at all. That is,if you expect to index "jack-bauer" and hit on "jack" or "bauer" it won't. Best Erick On Wed, Jun 3, 2009 at 2:25 AM, legrand thomas wrote: > Hi, > > A KeywordAnalyzer solved my problem. > Luke allowed me to understand the queries