Hi,
During my search on alternative for StandardAnalyzer , I got some useful
information about JFlex based FastAnalyzer in this user-group. I tried to
get corresponding files from
https://issues.apache.org/jira/browse/LUCENE-966 . But they are in txt
format and how can i get and test that impr
Hi,
I am using nutch index to search in lucene. One of my classes use
makeStopTable method ( which is deprecated) of class StopFilter in
org.apache.lucene.analysis. When I run my program with lucene 2.1.0
~/j2sdk1.4.2/bin/java -classpath .:lucene-core-2.1.0.jar SearchFiles
Exception in th
Greetings All,
I have been trying out Lucene recently and very happy with the search
performance. But just notice that when Lucene performing search or index,
the CPU usage on my machine raise to 100%, because of this issue, some of my
others backend process will slow down eventually. Just want
<<>>
Yes, the character set we use is, as I remember,
MARC-8. Which I don't think is the ISOLatin,
but since I didn't know about that filter when we had our problem,
I didn't even look. Oh well, smarter/braver/lazier next time ...
Which is why I love this list, I find things like this and loo
*SpanNearQuery
*(SpanQuery[]
clauses,
int slop, boolean inOrder)
Erick
On 7/30/07, Joe Attardi <[EMAIL PROTECTED]> wrote:
>
> What about the case where I want to search a MAC address? For example,
> 00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens
> "00", "14", "da", "81", "
not that I know of
Erick
On 7/30/07, Max Metral <[EMAIL PROTECTED]> wrote:
>
> I have a set of tags associated with content in my corpus. I also have
> normal text. Our system tries to figure out which "words" are tags and
> which are text, and falls back on text when tags fail. I'm wonder
Hi Tim!
On Jul 25, 2007, at 8:41 PM, Tim Sturge wrote:
I am indexing a set of constantly changing documents. The change
rate is moderate (about 10 docs/sec over a 10M document collection
with a 6G total size) but I want to be right up to date (ideally
within a second but within 5 seconds
What about the case where I want to search a MAC address? For example,
00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens
"00", "14", "da", "81", "21", and "4f".
Suppose I want to search for 00:14:da:81:21:4f. In the search box, I type
00:14:da:81:21:4f. But because these are a
Being a French speaker, I will mention the following special cases:
- "plus ça change" -> "plus ca change"
- "œuf" -> "oeuf"
- "lætitia" -> "laetitia"
But I just looked, and it looks like ISOLatin1AccentFilter handles these.
Better test to be sure...
--Renaud
-Original Message-
From:
Oh, yeah, I know now :-). But I really do have a requirement to show
search results from items that came in 5 seconds ago. We have an
application where a common usage pattern is
add an item
navigate to another item
search for the first item (to associate it with the second item)
and the gap b
Hi, Erick,
I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
and it works great! And I think it's the right way to go. Problems
like "You have to store the data raw for display purposes if you want
the accents to show though" will go away since Analyzer already have
the origin
Hi, Samir,
Thanks a lot for this tip! It works great!
--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Crea
And by the way, I cannot see it ever making sense to keep reopening an index
reader every second or so. It has to be MUCH more efficient to even wait
every 2 or 4 seconds...even that is going to be pretty nasty, but you have
to allow for a bit of batch man. You will waste so much time opening those
I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.
On 7/30/07, Tim Sturge <[EMAIL PROTECTED]> wrote:
>
> Thanks for the reply Erick,
>
> I believe it is the gc for four reasons:
>
> - I've tried the "warmup" approach alredy a
Thanks for the reply Erick,
I believe it is the gc for four reasons:
- I've tried the "warmup" approach alredy and it didn't change the
situation.
- The server completely pauses for several seconds. I run jstack to find
out where the pause is, and it also pauses for several seconds before
t
I have a set of tags associated with content in my corpus. I also have
normal text. Our system tries to figure out which "words" are tags and
which are text, and falls back on text when tags fail. I'm wondering,
is there anything in Lucene which might help disambiguate multi-word
tags from text?
Hi,
Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
and it should work !
Hope this will help,
Samir
-Message d'origine-
De : Chris Lu [mailto:[EMAIL PROTECTED]
Envoyé : lundi, 30. juillet 2007 20:06
À : java-user@lucene.apache.org
Objet : a question for frenc
Gosh, I sure hope not, because that would mean that we rolled our
own for no good reason. We wound up just collapsing
the input stream by substituting plain old 'e' for all the accented
variants before indexing and before searching. Be *really* careful
what character set you're using.
Actually, we
Hi,
I am not a French speaker, but here are some questions regarding
French analyzer:
Is there any analyzer that can do this? Analyze accentuated letters to
non accentuated corresponding letters (é,è,ê,ë -> e), so that
search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
and
sea
>
> So then would I just concatenate the tokens together to form
> the query text?
You might better create a TermQuery for each token instead of concatenating,
and combine them in a BooleanQuery and say wether all terms must or should
occur. Very simple, see [1]
Regards Ard
[1]
http://luce
So then would I just concatenate the tokens together to form the query text?
--
Joe Attardi
[EMAIL PROTECTED]
http://thinksincode.blogspot.com/
On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Would this work?
>
> TokenStream ts = StandardAnalyzer.tokenStream();
> while ((Token tok = t
Would this work?
TokenStream ts = StandardAnalyzer.tokenStream();
while ((Token tok = ts.next()) != null) {
do whatever
}
Best
Erick
On 7/30/07, Joe Attardi <[EMAIL PROTECTED]> wrote:
>
> Following up on my recent question. It has been suggested to me that I can
> run the query text through
Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer.
http://wiki.apache.org/lucene-java/BooleanQuerySyntax
The "best practice" appears to be to require parens everywhere to force the
evaluation order. Not very satisfying, but it does work 100%.
-Original Message-
From
Check this out:
http://www.gossamer-threads.com/lists/lucene/java-user/35433?search_string=category;#35433
On 7/30/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>
> We found that a fast way to do this simply by running a query for each
> category and getting the maxDocs. There would be one query
Hello,
> I have two questions.
>
> First, Is there a tokenizer that takes every word and simply
> makes a token
> out of it?
org.apache.lucene.analysis.WhitespaceTokenizer
> So it looks for two white spaces and takes the characters
> between them and makes a token out of them?
>
> If this to
I have two questions.
First, Is there a tokenizer that takes every word and simply makes a token
out of it? So it looks for two white spaces and takes the characters
between them and makes a token out of them?
If this tokenizer exists, is there a difference between doing that and
simply storing
Hi,
I am getting different results for the following queries.
1. ABST:"spring-elastic"^3 AND SPEC:"internal combustion"^2 OR
ABST:"cylinder"^3
2. SPEC:"internal combustion"^2 AND ABST:"spring-elastic"^3 OR
ABST:"cylinder"^3
I think the above two queries are similar and will give the sa
We found that a fast way to do this simply by running a query for each
category and getting the maxDocs. There would be one query for category
getting a single hit.
Dennis Kubes
Erick Erickson wrote:
You might want to search the mail archive for "facets" or "faceted search"
(no quotes), as I
> > It does sound very strange to me, to default to a
> WildCardQuery! Suppose I
> > am looking for "bold", I am getting hits for "old".
>
> I know - but that's what the requirements dictate. A better
> example might be
> a MAC or IP address, where someone might be searching for a
> string in
Or check out Solr and see if you can use that, or see how they do it,
Regards Ard
>
> You might want to search the mail archive for "facets" or
> "faceted search"
> (no quotes), as I *think* this might be relevant.
>
> Best
> Erick
>
> On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> >
>
Following up on my recent question. It has been suggested to me that I can
run the query text through an Analyzer without using the QueryParser. For
example, if I know what field to be searched I can create a PrefixQuery or
WildcardQuery, but still want to process the search text with the same
Anal
Hey Jeff, I didn´t had any luck, I don´t think you´re approach is going to
help me, thanks for the reply. I´ll try a solution that does not require
this kind of problem.
[]s
Rossini
On 7/29/07, Jeff French <[EMAIL PROTECTED]> wrote:
>
>
> Rossini, have you had any luck with this? I don't kno
>
> It does sound very strange to me, to default to a WildCardQuery! Suppose I
> am looking for "bold", I am getting hits for "old".
I know - but that's what the requirements dictate. A better example might be
a MAC or IP address, where someone might be searching for a string in the
middle - like,
See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default
max field length, last I knew, was 10,000. But this sounds like
it might relate to your issue.
Best
Erick
On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote:
>
> Hi guys,
>
> I would like to know if exist some limit of size
I've built a production index with this patch and done some query stress
testing with no problems.
I'd give it a thumbs up.
Peter
On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>
>
> Hi guys,
>
> Do you think LUCENE-843 is stable enough? If so, do you think it's worth
> to
> release it with probabl
You might want to search the mail archive for "facets" or "faceted search"
(no quotes), as I *think* this might be relevant.
Best
Erick
On 7/26/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
>
> Hi ,
> Of course this statement is very expensive.
> -->document.get("CAMPCATID")==null?"":document.get("
Hello,
> Hi everyone,
>
> I told you I'd be back with more questions! :-)
> Here is my situation. In my application, the field to be searched is
> selected via a drop-down box. I want my searches to basically
> be "contains"
> searches - I take what the user typed in, put a wildcard
> characte
Hi everyone,
I told you I'd be back with more questions! :-)
Here is my situation. In my application, the field to be searched is
selected via a drop-down box. I want my searches to basically be "contains"
searches - I take what the user typed in, put a wildcard character at the
beginning and end
Hi guys,
Do you think LUCENE-843 is stable enough? If so, do you think it's worth to
release it with probably LUCENE 2.2.1? It would be nice so that people can
take the advantage of it right away without risking other breaking changes
in the HEAD branch or waiting until 2.3 release.
Thanks,
--
30 jul 2007 kl. 14.43 skrev Grant Ingersoll:
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
There have also been a bunch of near-duplicate ideas that have been
presented on the forums before.
This is one of t
I believe Nutch has a duplicate detection algorithm. I don't know
how easy it would be to run independently on a Lucene index.
-Grant
On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
We trying to find are any implementation for Lucene - detection
index duclicates.
Assuming we have a set of doc
Hey Lukas,
I was being simplistic when I said that the text and TokenSteam must be
exactly the same. It's difficult to think of a reason why you would not
want them to be the same though. Each Token records the offsets where it
can be found in the original text -- that is how the Highlighter k
: Where shall i post this issue.
you are currently posting to a list named "java-user" this is for "user"
related questions about the "java" lucene project.
if you have questions about "Lucene.Net" you should be asking them on the
"Lucene.Net" user list...
http://incubator.apache.org/lucene.net
A couple of thoughts here...
You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line p
44 matches
Mail list logo