date:20070730

How to get FastAnalyzer?

2007-07-30 Thread SK R

Hi, During my search on alternative for StandardAnalyzer , I got some useful information about JFlex based FastAnalyzer in this user-group. I tried to get corresponding files from https://issues.apache.org/jira/browse/LUCENE-966 . But they are in txt format and how can i get and test that impr

Problem in Lucene

2007-07-30 Thread Srinivasarao Vundavalli

Hi, I am using nutch index to search in lucene. One of my classes use makeStopTable method ( which is deprecated) of class StopFilter in org.apache.lucene.analysis. When I run my program with lucene 2.1.0 ~/j2sdk1.4.2/bin/java -classpath .:lucene-core-2.1.0.jar SearchFiles Exception in th

High CPU usage duing index and search

2007-07-30 Thread Chew Yee Chuang

Greetings All, I have been trying out Lucene recently and very happy with the search performance. But just notice that when Lucene performing search or index, the CPU usage on my machine raise to 100%, because of this issue, some of my others backend process will slow down eventually. Just want

Re: a question for french analyzer

2007-07-30 Thread Erick Erickson

<<>> Yes, the character set we use is, as I remember, MARC-8. Which I don't think is the ISOLatin, but since I didn't know about that filter when we had our problem, I didn't even look. Oh well, smarter/braver/lazier next time ... Which is why I love this list, I find things like this and loo

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Erick Erickson

*SpanNearQuery *(SpanQuery[] clauses, int slop, boolean inOrder) Erick On 7/30/07, Joe Attardi <[EMAIL PROTECTED]> wrote: > > What about the case where I want to search a MAC address? For example, > 00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens > "00", "14", "da", "81", "

Re: Maximum phrase query?

2007-07-30 Thread Erick Erickson

not that I know of Erick On 7/30/07, Max Metral <[EMAIL PROTECTED]> wrote: > > I have a set of tags associated with content in my corpus. I also have > normal text. Our system tries to figure out which "words" are tags and > which are text, and falls back on text when tags fail. I'm wonder

Re: java gc with a frequently changing index?

2007-07-30 Thread Kay Roepke

Hi Tim! On Jul 25, 2007, at 8:41 PM, Tim Sturge wrote: I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi

What about the case where I want to search a MAC address? For example, 00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens "00", "14", "da", "81", "21", and "4f". Suppose I want to search for 00:14:da:81:21:4f. In the search box, I type 00:14:da:81:21:4f. But because these are a

RE: a question for french analyzer

2007-07-30 Thread Renaud Waldura

Being a French speaker, I will mention the following special cases: - "plus ça change" -> "plus ca change" - "œuf" -> "oeuf" - "lætitia" -> "laetitia" But I just looked, and it looks like ISOLatin1AccentFilter handles these. Better test to be sure... --Renaud -Original Message- From:

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge

Oh, yeah, I know now :-). But I really do have a requirement to show search results from items that came in 5 seconds ago. We have an application where a common usage pattern is add an item navigate to another item search for the first item (to associate it with the second item) and the gap b

Re: a question for french analyzer

2007-07-30 Thread Chris Lu

Hi, Erick, I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip, and it works great! And I think it's the right way to go. Problems like "You have to store the data raw for display purposes if you want the accents to show though" will go away since Analyzer already have the origin

Re: a question for french analyzer

2007-07-30 Thread Chris Lu

Hi, Samir, Thanks a lot for this tip! It works great! -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Crea

Re: java gc with a frequently changing index?

2007-07-30 Thread Mark Miller

And by the way, I cannot see it ever making sense to keep reopening an index reader every second or so. It has to be MUCH more efficient to even wait every 2 or 4 seconds...even that is going to be pretty nasty, but you have to allow for a bit of batch man. You will waste so much time opening those

Re: java gc with a frequently changing index?

2007-07-30 Thread Mark Miller

I believe there is an issue in JIRA that handles reopening an IndexReader without reopening segments that have not changed. On 7/30/07, Tim Sturge <[EMAIL PROTECTED]> wrote: > > Thanks for the reply Erick, > > I believe it is the gc for four reasons: > > - I've tried the "warmup" approach alredy a

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge

Thanks for the reply Erick, I believe it is the gc for four reasons: - I've tried the "warmup" approach alredy and it didn't change the situation. - The server completely pauses for several seconds. I run jstack to find out where the pause is, and it also pauses for several seconds before t

Maximum phrase query?

2007-07-30 Thread Max Metral

I have a set of tags associated with content in my corpus. I also have normal text. Our system tries to figure out which "words" are tags and which are text, and falls back on text when tags fail. I'm wondering, is there anything in Lucene which might help disambiguate multi-word tags from text?

RE: a question for french analyzer

2007-07-30 Thread Samir Abdou

Hi, Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer and it should work ! Hope this will help, Samir -Message d'origine- De : Chris Lu [mailto:[EMAIL PROTECTED] Envoyé : lundi, 30. juillet 2007 20:06 À : java-user@lucene.apache.org Objet : a question for frenc

Re: a question for french analyzer

2007-07-30 Thread Erick Erickson

Gosh, I sure hope not, because that would mean that we rolled our own for no good reason. We wound up just collapsing the input stream by substituting plain old 'e' for all the accented variants before indexing and before searching. Be *really* careful what character set you're using. Actually, we

a question for french analyzer

2007-07-30 Thread Chris Lu

Hi, I am not a French speaker, but here are some questions regarding French analyzer: Is there any analyzer that can do this? Analyze accentuated letters to non accentuated corresponding letters (é,è,ê,ë -> e), so that search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre" and sea

RE: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Ard Schrijvers

> > So then would I just concatenate the tokens together to form > the query text? You might better create a TermQuery for each token instead of concatenating, and combine them in a BooleanQuery and say wether all terms must or should occur. Very simple, see [1] Regards Ard [1] http://luce

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi

So then would I just concatenate the tokens together to form the query text? -- Joe Attardi [EMAIL PROTECTED] http://thinksincode.blogspot.com/ On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > Would this work? > > TokenStream ts = StandardAnalyzer.tokenStream(); > while ((Token tok = t

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Erick Erickson

Would this work? TokenStream ts = StandardAnalyzer.tokenStream(); while ((Token tok = ts.next()) != null) { do whatever } Best Erick On 7/30/07, Joe Attardi <[EMAIL PROTECTED]> wrote: > > Following up on my recent question. It has been suggested to me that I can > run the query text through

RE: Question regarding boolean query

2007-07-30 Thread Renaud Waldura

Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer. http://wiki.apache.org/lucene-java/BooleanQuerySyntax The "best practice" appears to be to require parens everywhere to force the evaluation order. Not very satisfying, but it does work 100%. -Original Message- From

Re: How to show category count with results?

2007-07-30 Thread Dima May

Check this out: http://www.gossamer-threads.com/lists/lucene/java-user/35433?search_string=category;#35433 On 7/30/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > We found that a fast way to do this simply by running a query for each > category and getting the maxDocs. There would be one query

RE: Tokenizer

2007-07-30 Thread Ard Schrijvers

Hello, > I have two questions. > > First, Is there a tokenizer that takes every word and simply > makes a token > out of it? org.apache.lucene.analysis.WhitespaceTokenizer > So it looks for two white spaces and takes the characters > between them and makes a token out of them? > > If this to

Tokenizer

2007-07-30 Thread John Paul Sondag

I have two questions. First, Is there a tokenizer that takes every word and simply makes a token out of it? So it looks for two white spaces and takes the characters between them and makes a token out of them? If this tokenizer exists, is there a difference between doing that and simply storing

Question regarding boolean query

2007-07-30 Thread Sonu SR

Hi, I am getting different results for the following queries. 1. ABST:"spring-elastic"^3 AND SPEC:"internal combustion"^2 OR ABST:"cylinder"^3 2. SPEC:"internal combustion"^2 AND ABST:"spring-elastic"^3 OR ABST:"cylinder"^3 I think the above two queries are similar and will give the sa

Re: How to show category count with results?

2007-07-30 Thread Dennis Kubes

We found that a fast way to do this simply by running a query for each category and getting the maxDocs. There would be one query for category getting a single hit. Dennis Kubes Erick Erickson wrote: You might want to search the mail archive for "facets" or "faceted search" (no quotes), as I

RE: Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Ard Schrijvers

> > It does sound very strange to me, to default to a > WildCardQuery! Suppose I > > am looking for "bold", I am getting hits for "old". > > I know - but that's what the requirements dictate. A better > example might be > a MAC or IP address, where someone might be searching for a > string in

RE: How to show category count with results?

2007-07-30 Thread Ard Schrijvers

Or check out Solr and see if you can use that, or see how they do it, Regards Ard > > You might want to search the mail archive for "facets" or > "faceted search" > (no quotes), as I *think* this might be relevant. > > Best > Erick > > On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > >

Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi

Following up on my recent question. It has been suggested to me that I can run the query text through an Analyzer without using the QueryParser. For example, if I know what field to be searched I can create a PrefixQuery or WildcardQuery, but still want to process the search text with the same Anal

Re: Search terms on a single "instance" of field

2007-07-30 Thread Rafael Rossini

Hey Jeff, I didn´t had any luck, I don´t think you´re approach is going to help me, thanks for the reply. I´ll try a solution that does not require this kind of problem. []s Rossini On 7/29/07, Jeff French <[EMAIL PROTECTED]> wrote: > > > Rossini, have you had any luck with this? I don't kno

Re: Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Joe Attardi

> > It does sound very strange to me, to default to a WildCardQuery! Suppose I > am looking for "bold", I am getting hits for "old". I know - but that's what the requirements dictate. A better example might be a MAC or IP address, where someone might be searching for a string in the middle - like,

Re: Size of field?

2007-07-30 Thread Erick Erickson

See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default max field length, last I knew, was 10,000. But this sounds like it might relate to your issue. Best Erick On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote: > > Hi guys, > > I would like to know if exist some limit of size

Re: LUCENE-843 Release

2007-07-30 Thread Peter Keegan

I've built a production index with this patch and done some query stress testing with no problems. I'd give it a thumbs up. Peter On 7/30/07, testn <[EMAIL PROTECTED]> wrote: > > > Hi guys, > > Do you think LUCENE-843 is stable enough? If so, do you think it's worth > to > release it with probabl

Re: How to show category count with results?

2007-07-30 Thread Erick Erickson

You might want to search the mail archive for "facets" or "faceted search" (no quotes), as I *think* this might be relevant. Best Erick On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > Hi , > Of course this statement is very expensive. > -->document.get("CAMPCATID")==null?"":document.get("

RE: Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Ard Schrijvers

Hello, > Hi everyone, > > I told you I'd be back with more questions! :-) > Here is my situation. In my application, the field to be searched is > selected via a drop-down box. I want my searches to basically > be "contains" > searches - I take what the user typed in, put a wildcard > characte

Indexing/Analyzer question - case-insensitive "contains" search

2007-07-30 Thread Joe Attardi

Hi everyone, I told you I'd be back with more questions! :-) Here is my situation. In my application, the field to be searched is selected via a drop-down box. I want my searches to basically be "contains" searches - I take what the user typed in, put a wildcard character at the beginning and end

LUCENE-843 Release

2007-07-30 Thread testn

Hi guys, Do you think LUCENE-843 is stable enough? If so, do you think it's worth to release it with probably LUCENE 2.2.1? It would be nice so that people can take the advantage of it right away without risking other breaking changes in the HEAD branch or waiting until 2.3 release. Thanks, --

Re: Detection of index dublicates in Lucene

2007-07-30 Thread karl wettin

30 jul 2007 kl. 14.43 skrev Grant Ingersoll: I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. There have also been a bunch of near-duplicate ideas that have been presented on the forums before. This is one of t

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Grant Ingersoll

I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. -Grant On Jul 29, 2007, at 2:18 AM, Dmitry wrote: We trying to find are any implementation for Lucene - detection index duclicates. Assuming we have a set of doc

Re: Bug in Lucene 2.2.0 code? Simple code included (StringIndexOutOfBoundsException).

2007-07-30 Thread Mark Miller

Hey Lukas, I was being simplistic when I said that the text and TokenSteam must be exactly the same. It's difficult to think of a reason why you would not want them to be the same though. Each Token records the offsets where it can be found in the original text -- that is how the Highlighter k

Re: Strange Error while deleting Documents from index while indexing.

2007-07-30 Thread Chris Hostetter

: Where shall i post this issue. you are currently posting to a list named "java-user" this is for "user" related questions about the "java" lucene project. if you have questions about "Lucene.Net" you should be asking them on the "Lucene.Net" user list... http://incubator.apache.org/lucene.net

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Michael Stoppelman

A couple of thoughts here... You could hash (e.g.md5) all the documents in your index and eliminate duplicates that way. Just pick one of the docs in the hash bucket as the non-dup document and the delete the other dups. This could be run as a batch job to eliminate the duplicates in an off-line p

44 matches

Mail list logo