Hi,
I am trying to compute the counts of terms of the documents returned
by running a query using a TermVectorMapper.
I was wondering if anyone knew if there was a faster way to do this
rather than using a HashMap with a TermVectorMapper to store the
counts of the terms and calling getTermF
Erick,
Thank you. This is awesome. I got it to work by just setting slop to 1
and returning 10 in my analyzer.getPositionIncrementGap. Here are my
tests in case anyone else is interested:
public class TestPositionIncrementGap extends TestCase {
Analyzer analyzer = new Keyword
Good point on isCurrent - I think it should only be with respect to
the latest index commit point? and we should clarify that in the
javadoc.
[...]
> // but what does the nrtReader say?
> // it does not have access to the most recent commit
> // state, as there's been a commit (with documents)
> /
Ok, thanks for the details. I see I'm not the only one finding the javadoc
hard to understand. While this is well documented, it's still not clear
enough about the exact semantics of "changes" : at first I thought it
returned an IndexReader on the *uncommited changes only*, which meant it did
not
I still see some things we might want to document or explain:
We still need to be careful what the call to "isCurrent()"
will mean in the future for IndexReaders - as now there is another
kind of "current" - "current even up to uncommitted changes".
Imagine the following set of IndexReaders float
Hi Paul,
Thanks for your suggestion. I will test it within the next few days.
However, due to memory limitations, it will only work if the number of hits
is small enough, am I right?
Chris
2009/10/12 Paul Elschot
> Chris,
>
> You could also store term vectors for all docs at indexing
> time, a
I think it was my email Yonik responded to and he is right, I was being lazy
and didn't read the javadoc very carefully.My bad.
Thanks for the javadoc change.
-John
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote:
> On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix
> wrote:
> > It may be surpri
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote:
> On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix
> wrote:
> > It may be surprising, but in fact I have read that
> > javadoc.
>
> It was not your email I responded to.
>
Sorry, my bad then - you said "guys" and John and I were the last two to b
OK I just committed it -- thanks!
Mike
On Mon, Oct 12, 2009 at 5:01 PM, Jake Mannix wrote:
> That seems a lot more straightforward Mike, thanks.
>
> -jake
>
> On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> I agree, the javadocs could be improved.
That seems a lot more straightforward Mike, thanks.
-jake
On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> I agree, the javadocs could be improved. How about something like
> this for the first 2 paragraphs:
>
> * Returns a readonly reader, covering
On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix wrote:
> It may be surprising, but in fact I have read that
> javadoc.
It was not your email I responded to.
> It talks about not needing to close the
> writer, but doesn't specifically talk about the what
> the relationship between commit() calls a
I agree, the javadocs could be improved. How about something like
this for the first 2 paragraphs:
* Returns a readonly reader, covering all committed as
* well as un-committed changes to the index. This
* provides "near real-time" searching, in that changes
* made during an IndexWri
Chris,
You could also store term vectors for all docs at indexing
time, and add the termvectors for the matching docs into a
(large) map of terms in RAM.
Regards,
Paul Elschot
On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> Hi Jake,
>
> Thanks for your helpful explanation.
> In fac
Hi Cedric,
There is a wiki page on NRT at:
http://wiki.apache.org/lucene-java/NearRealtimeSearch
Feel free tp ask questions if there's not enough information.
-J
On Mon, Oct 12, 2009 at 2:24 AM, melix wrote:
>
> Hi,
>
> I'm going to replace an old reader/writer synchronization mechanism we had
<<>>
Not quite. Starting with the second add, a call will be made to
getPositionIncrementGap in your analyzer. If you return a number
larger than one, then the offsets between the last term of the preceeding
add and the first term of this add will be that number. If you do nothing
with getPositionI
Thanks Yonik,
It may be surprising, but in fact I have read that
javadoc. It talks about not needing to close the
writer, but doesn't specifically talk about the what
the relationship between commit() calls and
getReader() calls is. I suppose I should have
interpreted:
"@returns a new reader
I need to analyze these values since I also want the benefits
porterStemmer. The problem with using PhraseQuery is that I don't
always know the slop. I may have values like "value4 ddd aaa". It's a
tricky problem because I think Lucene sees all these values as one long
value for the field "optio
Guys, please - you're not new at this... this is what JavaDoc is for:
/**
* Returns a readonly reader containing all
* current updates. Flush is called automatically. This
* provides "near real-time" searching, in that changes
* made during an IndexWriter session can be made
* a
Or else just make sure that you use PhraseQuery to hit this field when you
want "value1 aaa". If you don't tokenize these pairs, then you will have to
do prefix/wildcard matching to hit just "value1" by itself (if this is
allowed
by your business logic).
-jake
On Mon, Oct 12, 2009 at 1:21 PM,
Hi Eric,
To achieve what you want, do not tokenize the values you query/add to this
field.
On Mon, Oct 12, 2009 at 4:05 PM, Angel, Eric wrote:
> I have documents that store multiple values in some fields (using the
> document.add(new Field()) with the same field name). Here's what a
> typical
Oh, that is really good to know!
Is this deterministic? e.g. as long as writer.addDocument() is called, next
getReader reflects the change? Does it work with deletes? e.g.
writer.deleteDocuments()?
Thanks Mike for clarifying!
-John
On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless <
luc...@mik
I have documents that store multiple values in some fields (using the
document.add(new Field()) with the same field name). Here's what a
typical document looks like:
doc.option="value1 aaa"
doc.option="value2 bbb"
doc.option="value3 ccc"
I want my queries to only match individual values,
Wow! This is awesome. Can't wait to see how it plays with Bobo :)
On Sun, Oct 11, 2009 at 10:19 PM, John Wang wrote:
> Hi guys:
> The new FieldComparator api looks really scary :)
>
> But after some perf testing with numbers I'd like to share, I guess it
> is worth it:
>
> HW: Mac Pro with
On Mon, Oct 12, 2009 at 12:26 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix
> wrote:
>
> > Wait, so according to the javadocs, the IndexReader which you got from
> > the IndexWriter forwards calls to reopen() back to
> IndexWriter.getRea
Hi Jake,
Thanks for your helpful explanation.
In fact, my initial solution was to traverse each document in the result
once and count the contained terms. As you mentioned, this process took a
lot of memory.
Trying to confine the memory usage with the facet approach, I was surprised
by the decline
On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix wrote:
> Wait, so according to the javadocs, the IndexReader which you got from
> the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(),
> which means that if the user has a NRT reader, and the user keeps calling
> reopen() on it,
Wait, so according to the javadocs, the IndexReader which you got from
the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(),
which means that if the user has a NRT reader, and the user keeps calling
reopen() on it, they're getting uncommitted changes as well, while if they
cal
Just to clarify: IndexWriter.newReader returns a reader that searches
uncommitted changes as well. Ie, you need not call IndexWriter.commit
to make the changes visible.
However, if you're opening a reader the "normal" way
(IndexReader.open) then it is necessary to first call
IndexWriter.commit.
The source code attachment got somehow lost:
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apa
Hallo Paul,
I implemented what you wanted in the applied testcase. Works without
problems. Your error was, that in the TermQuery creation you placed a
precisionStep in the shift value parameter which is incorrect.
By the way: Lucene 2.9.1 and Lucene 3.0 will be optimized for ranges like [1
TO 1],
Hey Chris,
On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
christoph.bo...@googlemail.com> wrote:
> Thanks for your reply.
> Yes, it's likely that many terms occur in few documents.
>
> If I understand you right, I should do the following:
> -Write a HitCollector that simply increments a coun
Thanks for your reply.
Yes, it's likely that many terms occur in few documents.
If I understand you right, I should do the following:
-Write a HitCollector that simply increments a counter
-Get the filter for the user query once: new CachingWrapperFilter(new
QueryWrapperFilter(userQuery));
-Create
Hi Cedric,
I don't know of anyone with a substantial throughput production system who
is doing realtime search with the 2.9 improvements yet (and in fact, no
serious performance analysis has been done on these even "in the lab" so to
speak: follow https://issues.apache.org/jira/browse/LUCENE-157
Thanks a lot. I think TermPositionsVector will solve my problem.
Although it seems to be a little inperformant
Concerning the term representation: our data is way more complex then
just phrasal annotation, it was just an example, because I am not
allowed to talk about our internal organisation. I
Given you have 1M docs and about 1M terms, do you see very few docs per
term?
If your DocSet per term is very sparse, BitSet is probably not a good
representation. Simple int array maybe better for memory, and faster for
iterating.
-John
On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot wrote:
> On
Uwe Schindler wrote:
Can you print the upper and lower term or the term you received in
newRangeQuery and newTermQuery also to System.out? Maybe it is converted
somehow by your Analyzer, that is used for parsing the query.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.theta
Can you print the upper and lower term or the term you received in
newRangeQuery and newTermQuery also to System.out? Maybe it is converted
somehow by your Analyzer, that is used for parsing the query.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thet
On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> Hi,
>
> I have a question related to faceted search. My index contains more than 1
> million documents, and nearly 1 million terms. My aim is to get a DocIdSet
> for each term occurring in the result of a query. I use the approach
> descr
Hi,
I have a question related to faceted search. My index contains more than 1
million documents, and nearly 1 million terms. My aim is to get a DocIdSet
for each term occurring in the result of a query. I use the approach
described on
http://sujitpal.blogspot.com/2007/04/lucene-search-within-sear
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
nitingupta183 schrieb:
> Hi all,
>
> I am supposed to add a feature in which my app will detect the duplicate
> contacts of a user on the basis of their name, email, mobile number
> etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest al
Hi all,
I am supposed to add a feature in which my app will detect the duplicate
contacts of a user on the basis of their name, email, mobile number
etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i
can think of is find all the contacts on the basis of their name, email an
You are storing this field without analysis, correctly as you want
exact matches only, but using StandardAnalyzer at query time. Use
PerFieldAnalyzerWrapper, specifying KeywordAnalyzer for this field.
Using MultiFieldQueryParser may not make much sense here.
--
Ian.
On Mon, Oct 12, 2009 at 11
Uwe Schindler wrote:
I forgot: The format of numeric fields is also not plain text, because of
this a simple TermQuery as generated by your query parser will not work,
too.
If you want to hit numeric values without a NumericRangeQuery with lower and
upper bound equal, you have to use NumericUtil
Hi ,
I am using StandardAnalyzer for indexing as well as searching the
indexes.But my search doesn't work correctly with special characters.I am
storing some special characters in a field called TransType.ie
document.add(new Field("TransType", "db92fb60-b716-11de-8718-001a4bc7d46e",
Field
Hi,
I'm going to replace an old reader/writer synchronization mechanism we had
implemented with the new near realtime search facilities in Lucene 2.9.
However, it's still a bit unclear on how to efficiently do it.
Is the following implementation the good way to do achieve it ? The context
is con
45 matches
Mail list logo