Re: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread Robert Muir
ok, well at first i thought you must be playing a joke on me or something... Maybe you want to create a lucene analyzer that mimic's solr defaults. Search the mail archives for this recent thread, and KK posted his code: Re: How to support stemming and case folding for english content mixed with

RE: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread OBender Hotmail
That's the thing there is no actual requirement. I've been presented with all the languages that company theoretically provides. My guess is that what I'm going to end up with is all western languages, good share of Arabic family, complete set of Eastern and Eastern European ones and of course CJ

Re: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread Robert Muir
Really, you have a requirement that the system should search written Cornish? I think you might have larger problems! On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail wrote: > Here is the list of possible languages. Don't laugh :) I know those are > almost all world languages but it is a true re

RE: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread OBender Hotmail
Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB. --- Afrikaans Albanian Arabi

Re: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread Robert Muir
its not too bad, here would be a simple one that only breaks words on whitespace and lowercases: public class Example extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = new WhitespaceTokenizer(reader); ts = new LowerCaseFilter(ts); retur

RE: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread OBender Hotmail
I've looked over SolR quickly, it is a bit too heavy for my project. So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, June 15, 2009 4:51 PM To: java-user@

Re: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread Robert Muir
Well just reply back if SolR is inappropriate for your needs. In that case, you will need to build a custom analyzer (its not too bad), so that you can use compass. On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail wrote: > Hi, > > My goal is to find a framework that encapsulates as much low level

RE: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread OBender Hotmail
Hi, My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring. It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer. -Origina

lucene highligher snowball analyzer highlight stem words

2009-06-15 Thread faisalloe
I am having the following method to highlight the terms. public static String getHighlighter(String colName, Highlighter highlighter,IndexSearcher searcher,int id, Analyzer analyzer) throws IOException { String highlightTerm; TokenStream tokenStream;

Re: Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread Robert Muir
Hi, (Since this is an issue you brought up on the Compass forums) I wonder what stage you are in the development process? Have you considered SolR, or does compass provide some other functionality that you need? The reason I say this, is because the easiest solution might be to use a nightly Sol

Lucene and multi-lingual Unicode - advice needed

2009-06-15 Thread OBender Hotmail
Hi All! I'm new to Lucene so forgive me if this question was asked before. I have a database with records in the same table in many different languages (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc. you name it. I've looked at what people say about Lucene and it l

Re: Fuzzy vs Prefix query Performance

2009-06-15 Thread mark harwood
FuzzyQuery performance is related to number of unique terms in the index not the number of documents e.g. a single "telephone directory" document could contain millions of terms. Each term considered is compared using an "edit distance" algo which is CPU intensive. The FuzzyQuery prefix length

Re: Fuzzy vs Prefix query Performance

2009-06-15 Thread Zsolt Koppany
Erick, this a web application running 24 hours a day thus caching cannot be the reason. I get the same result after I re-start the same search. Zsolt Erick Erickson wrote: Well, if you're seeing it, it's possible But the first question is always "what were you measuring?" Be aware that

Re: Fuzzy vs Prefix query Performance

2009-06-15 Thread Erick Erickson
Well, if you're seeing it, it's possible But the first question is always "what were you measuring?" Be aware that when you open a searcher, the first few queries can fill caches, etc and may take an anomalously long time, especially if you're sorting. So could you give more details of your t

Fuzzy vs Prefix query Performance

2009-06-15 Thread Zsolt Koppany
Hi, on 99470 documents (I mean Lucene documents) a FuzzyQuery needs approx 30 seconds but PrefixQuery less than one. All Lucene files need 65MB together. I'm bit surprised of that. Is that possible? Zsolt Zsolt Koppany Phone: +49-711-67400-679 --

Re: London Open Source Search meetup - Mon 15th June

2009-06-15 Thread Richard Marr
Thanks Joel, good point. We'll definitely be there by 7pm but may be a little earlier if the will to continue working is elusive. 2009/6/14 Joel Halbert : > Hi Rich - from what time? > > > -Original Message- > From: Richard Marr > Reply-To: java-user@lucene.apache.org > To: java-user@l

Re: Using lucene in a clustered app server

2009-06-15 Thread Tarandeep Singh
On Mon, Jun 15, 2009 at 1:04 AM, Amin Mohammed-Coleman wrote: > Hi > > I'm looking at Hadoop and Katta and I was wondering if some may be able > clarify the following: > > 1) Is Katta replacing the Hadoop Lucene contribution You mean the index package in Hadoop's contrib folder? So far what I ha

Re: London Open Source Search meetup - Mon 15th June

2009-06-15 Thread Joel Halbert
Hi Rich - from what time? -Original Message- From: Richard Marr Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: London Open Source Search meetup - Mon 15th June Date: Fri, 12 Jun 2009 12:54:30 +0100 Hi all, Just a quick reminder that this is happening

Re: Using lucene in a clustered app server

2009-06-15 Thread Amin Mohammed-Coleman
Hi I'm looking at Hadoop and Katta and I was wondering if some may be able clarify the following: 1) Is Katta replacing the Hadoop Lucene contribution 2) Are people still using Hadoop Lucene to perform indexing Cheers Amin On Sat, Jun 13, 2009 at 7:46 AM, Amin Mohammed-Coleman wrote: > Hi > T