But I also see importance of ignoring score calculation.
If you put it aside performance gain, is there any possibility to completely
ignore scoring calculation?
Jelda
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
> Of Yonik Seeley
> Sent: Wednesday,
: return !Character.isWhitespace(c);
: And my class override that method as this:
: return !((int)c==32);
in my opinion that's a pretty naive change ... it won't split on tab
characters or newlines ... even for trivial ASCII text that's probably not
what you want.
: I think the Charact
"Laxmilal Menaria" <[EMAIL PROTECTED]> wrote:
> > I am getting Lock obtain timed exception while Searching in Index.
> >
> > My Steps: I have created a Lucene Index at first week of may 2007, after
> > that I have nothing changed in index folder. Just I am searching. Searcher
> > code have only M
I've built a Lucene system that gets rapidly updated - documents are
supposed to be searchable immeidately after they've been indexed.
As such I have a Writer that puts new index, update and delete tasks
into a queue and then has a thread which consumes them and applies them
to the index using
WITH_OFFSETS gives the equivalent of Token.startOffset and
Token.endOffset information which is the actual offset in the String
(although it can be manipulated), while WITH_POSITIONS gives the
position information (which can also be manipulated). Position info
tells where the token occurs
yes, I am getting the JVM crash exception in logs.
#
# An unexpected error has been detected by Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate.
Out of swap space?
#
# Internal Error (414C4C4F434154494F4E0E4350500065), pid=25596, tid=90152
Hi,
I indexed emails. And now i want to restrict the search functionality for
users so they only can search for emails to/from him.
i know the email address of the user so my plan is to do it in the following
way:
The user enters some search parameters, they are combined in a query.
This is a mi
Hi,
This sounds good. As for the code injection it is up to you to sanitize
the request before it goes to lucene, probably by filling the email
field yourself and not rely on the user input for the email address
I hoped i havent to sanitize the user input cause the email address
query is ANDed
Hi Joe,
It might be possible when you append the restriction before parsing the
user query with the QueryParser, but I'm not sure. I recommend first
parsing the query, and then constructing a BooleanQuery with the parsed
user query and the e-mail term both as must. Another approach would be
to use
I know of no way to alter the Hits behavior, I recommend using
a TopDocs/TopDocCollector.
But be aware that if you load the document for each one, you may incur
a significant penalty, although the lazy-loading helped me a lot, see
FieldSelector.
On 5/23/07, Carlos Pita <[EMAIL PROTECTED]> wr
Hi Joe,
It would probably be cleaner to use a QueryFilter rather than doing the AND.
Take a look at
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter
.html
Also I'm not sure that using the sent to field will work - people may
receive email from a list, such as this, whe
You can create two indexes. One will be for new documents, let say the
last 24 hours and another one for older documents. This way you will
only update a small portion of your index while the large index will
remain relatively constant so you don't have to get a new searcher for
it.
HTH
Aviran
ht
This sounds good. As for the code injection it is up to you to sanitize
the request before it goes to lucene, probably by filling the email
field yourself and not rely on the user input for the email address.
HTH
Aviran
http://www.aviransplace.com
http://shaveh.co.il
-Original Message-
Another option would be to only re-open your searcher when actually
needed, that is after the index has changed. This only does you some
good when you have some hope that there are sizable gaps in
your modifications
Another possibility is to relax the "immediately" constraint. Would
a maximum
Damien McCarthy schrieb:
Hi Joe,
It would probably be cleaner to use a QueryFilter rather than doing the AND.
Take a look at
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter
.html
ok if its not to slow i go this way.
Also I'm not sure that using the sent to fiel
Hi,
Hi Joe,
It might be possible when you append the restriction before parsing the
user query with the QueryParser, but I'm not sure. I recommend first
parsing the query, and then constructing a BooleanQuery with the parsed
user query and the e-mail term both as must.
yes thats the idea.
An
On Thu, May 24, 2007 at 09:28:30AM -0400, Erick Erickson said:
> If that's unacceptable, you can *still* open up a new reader in the
> background and warm it up before using it. "immediately" then
> becomes 5-10 seconds or so.
This is currently what I'm doing using a list of previous performed
qu
Hi,
On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
If that's unacceptable, you can *still* open up a new reader in the
background and warm it up before using it. "immediately" then
becomes 5-10 seconds or so.
I've seen the term "warming" used a few times on the various lists.
What const
Yep. You probably want to do some sorting by other than relevancy
too in order to fill the sort caches.
Erick
On 5/24/07, Joe Shaw <[EMAIL PROTECTED]> wrote:
Hi,
On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> If that's unacceptable, you can *still* open up a new reader in the
> b
Hi Erick,
thank you for your prompt answer. What do you mean by loading the document?
Accessing one of the stored fields? In that case I'm afraid I would need to
do it. For example, in the aforementioned case of a result of products, I
have to look at any product store_id, which is stored along t
Hi all!
I implemented a searcher with Lucene and i´m trying to search two words, the
both into the same text file, but...i can´t!
When I search the first word and the second separated, everithing happens ok,
but when together, with or wtithout "AND" or "+"...nothing is found! :(
Can somebody h
On 5/24/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
But I also see importance of ignoring score calculation.
If you put it aside performance gain, is there any possibility to completely
ignore scoring calculation?
Yes, for unsorted results use a hit collector and no sorting will be
done by sco
Hi,
I'm trying to figure what I need to do with Lucene to score a
document higher when it has a larger number of unique search terms
that are hit, rather than term frequency counts.
A quick example.
If I'm searching for "BIRD CAT DOG" (all should clauses), then I want
...a document with B
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 5/24/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> > But I also see importance of ignoring score calculation.
> >
> > If you put it aside performance gain, is there any possibility to completely
> > ignore scoring calculation?
>
> Yes, for unsorted r
Hello users,
I am right now developing an algorithm to calculate the shortest snippet
from the search results for a given keyword of length n (from user query).
From the lucene source I found that there is a method getBestFragments
which would do the same. However its very hard to interpret
Hi, Thanks for helps!
Yes, along the line you mentioned we can reduce the amount
of calculation, but we still need to loop through to count
all docs, so time may still be O(n), I am wondering if we
can avoid the loop to get count directly?
Best regards, Lisheng
-Original Message-
From: M
You're on the right track. But that said, access to anything that's
indexed (stored or not) should be pretty quick. Things
stored, but not indexed, are costlier. This might drive your
decision on what to index .vs. store.
Loading the document is anything like IndexReader.document(), or
Hits.d
Not until you give us more information .
In particular, what analyzers you use at index and search time.
What the string was originally and how you indexed it.
What query.toString() shows you.
Best
Erick
On 5/24/07, Rodrigo F Valverde <[EMAIL PROTECTED]> wrote:
Hi all!
I implemented a search
"Zhang, Lisheng" <[EMAIL PROTECTED]> wrote:
> Hi, Thanks for helps!
>
> Yes, along the line you mentioned we can reduce the amount
> of calculation, but we still need to loop through to count
> all docs, so time may still be O(n), I am wondering if we
> can avoid the loop to get count directly?
Hi all,
Is there any guaranty that the maxDoc returned by a reader will be about the
total number of indexed documents?
The motivation of this question is that I want to associate some info to
each document in the index, and in order to access this additional data in
O(1) I would like to do this
Hi Erick,
I don't think that FieldSelector would be that valuable in my case because I
just need to access a few fields, and those are all fields that are in fact
stored (and indexed too). I was thinking of keeping this extra information
in memory, precisely into an array mapping doc ids to the d
Hi,
I found the problem. The version of Lucene on server is 2.1 while on
client is 1.9.
Thanks
On Wed, 2007-05-23 at 13:52 -0600, Su.Cheng wrote:
> Hi,
> I studied "5.6 Searching across multiple Lucene indexes 178" in < in action>>.
>
> I have 2 remote serarch computers(SearchServer) work as
Hi all,
I have an ID field which I index using the KeywordAnalyzer. Since this
analyzer tokenizes the entire stream as a single token, would you say the
end result is the same as using any analyzer and specifying this ID field as
untokenized? The latter approach does not use the analyzer so would
See below...
On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote:
Hi all,
Is there any guaranty that the maxDoc returned by a reader will be about
the
total number of indexed documents?
No. It will always be at least as large as the total documents. But that
will also count deleted documents
Hello,
Currently we are attempting to optimize the search time against an index
that is 26 GB in size (~35 million docs) and I was wondering what
experiences others have had in similar attempts. Simple searches
against the index are still fast even at 26GB, but the problem is our
application
I will try to resume the code:
INDEX TIME
- IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
- writer.setUseCompoundFile(false);
- while has files into the given dir...
- Document doc = new Document();
- doc.add(new Field("content", new FileReader(file)));
- doc.add(ne
Why wouldn't numdocs serve?
Because the document id (which is the array index) would be in the range 0
... maxDoc and not 0 ... numDocs, wouldn't it?
Cheers,
Carlos
Best
Erick
The motivation of this question is that I want to associate some info to
> each document in the index, and in ord
Have a look at the DisjunctionMaxQuery, I think it might help,
although I am not sure it will fully cover your case.
-Grant
On May 24, 2007, at 11:22 AM, Walt Stoneburner wrote:
Hi,
I'm trying to figure what I need to do with Lucene to score a
document higher when it has a larger number of
No. It will always be at least as large as the total documents. But that
will also count deleted documents.
Do you mean that deleted document ids won't be reutilized, so the index
maxDoc will grow more and more with time? Isn't there any way to compress
the range? It seems strange to me, con
Eric,
I was pursuing a different direction yesterday which is not fast enough.
Basically I was using the highlighter to figure out if a page has a hit
or not. But that is too expensive. I end up with 15 ms per page and
that sums up.
I have to allow ad-hoc queries, so it sounds like the solution
Scott,
Yes, take your big index and split it into multiple smaller shards. Put those
shards in different servers and then query them remotely (using the provided
RMI thing in Lucene or using something custom), take top N results from each
searcher, merge those, and take top N from the merged r
Terry,
I think you are right. Just use UN_TOKENIZED, that will do what you need.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: dontspamterry <[EMAIL PROTECTED]>
To: java-user@lucene.
Carlos,
It sounds like you'll have to build logic that knows when the index has been
reopened and repopulates your cache. Take a look at Solr, it does this type of
stuff.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Shar
Carlos:
Answer to your last question: No, but if you look in JIRA, Karl Wettin has
written something that does have a notification mechanism that you are
describing.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
--
Hi Otis,
I tried both ways, did some queries, and results are the same, so I guess
it's a matter of preference???
-Terry
Otis Gospodnetic wrote:
>
> Terry,
> I think you are right. Just use UN_TOKENIZED, that will do what you need.
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . .
Hi again!
That´s all diferent now!
I´m no more using the "reader.search()"...now, i´m using the QueryParser:
- QueryParser qp = new QueryParser("content", new StandardAnalyzer());
- query = qp.parse(keyWordToSearch);
now, it works fine! :D
But now I need to know the diference between them! :)
T
Hi Terry,
The one place I know where KeywordAnalyzer is definitely useful is when
it is used in conjunction with PerFieldAnalyzerWrapper.
Steve
dontspamterry wrote:
> Hi Otis,
>
> I tried both ways, did some queries, and results are the same, so I guess
> it's a matter of preference???
>
> -Te
: just need to access a few fields, and those are all fields that are in fact
: stored (and indexed too). I was thinking of keeping this extra information
: in memory, precisely into an array mapping doc ids to the data structure. I
if the fields you need are indexed and single valued (and untoken
Well, my data may not be too helpful. But some of the books I'm
counting hits for are a thousand-plus pages. We haven't had
performance issues, but that's only saying "no customer has
complained yet".
The old solution we used did something similar to what you're
talking about, basically streaming
Document IDs will be re-utilized, after, say, optimization.
One consequence of this is that optimization will change the IDs
of *existing* documents.
You're right, that numdocs may well be shorter than maxdocs.
That's what I get for reading quickly...
Best
Erick
On 5/24/07, Carlos Pita <[EMAIL
Hi Scott,
I met the same situation as you(index 100M documents). If the computer
has only one CPU and one disk, ParallelMultiSearcher is slower than
MultiSearcher.
I wrote an email "Who has sample code of remote multiple servers
multiple indexes searching" yesterday. If you have any suggestion,
That's no problem, I can regenerate my entire extra data structure upon
periodic index optimization. That way the array size will be about the size
of the index. What I find more difficult is to know the id of the last
added/removed document. I need it to update the in-mem structure upon more
fin
If you haven't, I *strongly* recommend you get a copy of luke.
google lucene and luke to find it. It allows you to examine your
index and also to see how queries parse. It's invaluable.
I can't say exactly what the difference is, but there are
several possibilities. Note that in general it's best
From the Javadoc for IndexReader.
Returns one greater than the largest possible document number. This may be
used to, e.g., determine how big to allocate an array which will have an
element for every document number in an index.
Isn't that what you're wondering about?
Erick
On 5/24/07, Ca
On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote:
I need it to update the in-mem structure upon more
fine-grained index changes. Any ideas?
Currently, a deleted doc is removed when the segment containing it is
involved in a segment merge. A merge could be triggered on any
addDocument(), mak
Yes Erick, that's fine. But the fact is that I'm not sure whether the next
added document will have an id equal to maxDocs. If this is guaranteed, then
I will update the maxDocs slot of my extra data structure upon document
addition and get rid of the hits.id(0) slot upon document deletion. Then,
On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote:
Yes Erick, that's fine. But the fact is that I'm not sure whether the next
added document will have an id equal to maxDocs.
Yes. The highest docId will always be the last document added, and
docIds are never re-arranged with respect to each ot
Yes guy, i have luke yet! :)
The words i used were: "maria" and "amanda".
The first word, is in one text file and the second is in the same one and
another (so, two files).
Changing the "IndexSearcher.search()" by "QueryParser.parse()" and keep
everything equal, all works fine.
By luke and by
I have done some benchmarks. Keeping things in an array makes the entire
search, including postprocessing from first to last id for a big result set,
extremely fast. So I would really like to implement this approach. But I'm
concerned about what Yonik remarked. I could use a large mergeFactor but
: extremely fast. So I would really like to implement this approach. But I'm
: concerned about what Yonik remarked. I could use a large mergeFactor but
: anyway, just to be sure, is there a way to make the index inform my
: application of merging events?
this entire thread seems to be a discussio
Mh, some of my fields are in fact multivaluated. But anyway, I could store
them as a single string and split after retrieval.
Will FieldCache work for the first search with some query or just for the
successive ones, for which the fields are already cached?
Cheers,
Carlos
On 5/24/07, Chris Hoste
: Mh, some of my fields are in fact multivaluated. But anyway, I could store
: them as a single string and split after retrieval.
: Will FieldCache work for the first search with some query or just for the
: successive ones, for which the fields are already cached?
The first time you access the ca
Hi Su,
I came across some discussion of ParallelMultiSearcher and RMI in
chapter 5 of the book Lucene in Action. There are a couple of examples
in there, so might be a good place to start.
-Scott
-Original Message-
From: Su.Cheng [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 24, 2007
Nice, I will write the ids into a byte array with a DataOutputStream and
then marshal that array into a String with a UTF8 encoding. This way there
is no need for parsing or splitting, and the encoding is space efficient.
This marshaled String will be cached with a FieldCache. Thank you for your
s
Su.Cheng wrote:
Hi Scott,
I met the same situation as you(index 100M documents). If the computer
has only one CPU and one disk, ParallelMultiSearcher is slower than
MultiSearcher.
I wrote an email "Who has sample code of remote multiple servers
multiple indexes searching" yesterday. If you ha
I have some application will indexing new data to one Index Directory.
And other some application will read the index and Data Mining.
But my Mining Application must re-open the index Directory. The Index file have
5G . and must real time mining .
How Can I do it at many computer at one n
Hi,
My understanding is that once you have added documents to your index you
need to close and reopen your IndexReader and Searcher, otherwise the
documents added will not be available to these.
You might want to try LuceneIndexAccessor
(http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip) w
Carlos Pita wrote:
Hi all,
Is there any guaranty that the maxDoc returned by a reader will be about
the
total number of indexed documents?
It struck me in this thread was that there may be a misunderstanding of the
relationship between numDocs/maxDoc and an IndexReader.
When an IndexReade
I see. Anyway I would update the array when adding a document, so my reader
would be closed then, and just a writer would be accessing the index.
Supposing that no merging is triggered (for this I'm choosing a big
mergeFactor and forcing optimization when a number of documents has been
added) the
69 matches
Mail list logo