DuplicateFilter.java

2009-08-05 Thread Paul
ucene/search/DuplicateFilter.html [2] http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_4/src/java/org/apache/lucene/search/ Thanks, Paul.

Beginner's questions

2013-03-26 Thread Paul
t as coherent as I can make them. Thank you. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Storing Documents in Lucene

2013-03-28 Thread Paul
out the indexing and less about storing documents. Thank you. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to use concurrency efficiently

2013-04-02 Thread Paul
something about the abstract class MultiTermQuery, but I don't really understand whether or not it would help with this problem. Thank you. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additio

Lucene commit

2016-08-21 Thread Paul Masurel
hable after another one even though it was added before. The benefit would be to reduce the average latency for a document to become searchable, without hurting throughput by calling commit() too frequently. Regards, Paul

Re: Lucene commit

2016-08-22 Thread Paul Masurel
Awesome! Thank you very much! On Mon, Aug 22, 2016 at 3:45 PM, Christoph Kaser wrote: > Hello Paul, > > this is already possible using > DirectoryReader.openIfChanged(indexReader,indexWriter). > This will give you an indexreader that already "sees" all changes ma

Re: Environmental Protection Agency: Stop Deforesting in Sri Lanka

2019-03-21 Thread Noble Paul
re and sign the petition here: > > > > http://chng.it/vY78rzGf8G > > > > Thanks! > > Janaka > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional

CVE-2018-11802: Apache Solr authorization bug vulnerability disclosure

2019-04-24 Thread Noble Paul
CVE-2018-11802: Apache Solr authorization bug disclosure Severity: Important Vendor: The Apache Software Foundation Versions Affected: Apache Solr 7.6 or less Description: jira ticket : https://issues.apache.org/jira/browse/SOLR-12514 In apache Solr the cluster can be partitioned into multiple co

Re: [VOTE] Lucene logo contest, here we go again

2020-09-01 Thread Noble Paul
tps://issues.apache.org/jira/browse/LUCENE-9221 > [first-vote] > http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e > [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting > > -- - Noble Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-02 Thread Noble Paul
e_logo_green_300.png > >> > >> Please vote for one of the above choices. This vote will close about one > >> week from today, Mon, Sept 7, 2020 at 11:59PM. > >> > >> Thanks! > >> > >> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221 > >> [first-vote] > >> http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e > >> [second-vote] > >> http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e > >> [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -- - Noble Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using Lucene for technical documentation

2020-11-23 Thread Paul Libbrecht
parametrisation). But I’d be happy to read of others’ works on this! In the Math working group of W3C at the time, work stopped when considering the complexity of compound documents: the alternatives as above (mix words or recognise math pieces?) certainly made things difficult. paul PS: [paper for

[ANNOUNCE] Apache Lucene 8.8.0 released

2021-02-01 Thread Noble Paul
you are using may not have replicated the release yet. If that is the case, please try another mirror. This also applies to Maven access. - - Noble Paul -BEGIN PGP SIGNATURE- Version: FlowCrypt Email Encryption 8.0.0 Comment: Seamlessly

Re: Document metadata in ranking?

2021-02-25 Thread Paul Libbrecht
that the influence of a positive category takes precedence over the different orderings (TF-IDF per default). At the end you can write custom-score-engine but I can only imagine ruining the performance when doing so... paul On 26 Feb 2021, at 3:40, Philip Warner wrote: I am sorry if this has

Re: Search results/criteria validation

2021-03-17 Thread Paul Libbrecht
queries and in the score’s fineness, it was indicating thing sub-query was used. This was used to attempt highlighting matching of the parts of a formula. Paul On 17 Mar 2021, at 20:24, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote: Maybe using explain? https://chrisperks.co/2017/06/06/explaining

Use of In-like query and performance implications

2005-03-02 Thread Paul Smith
. my question is, is there any performance concerns here if ("...In(g,h,i,j,) ") starts getting longer and longer? Can Lucene handle this in an optimal manner, without a serious scalability issue ? (memory/cpu/io etc). Or would it be better that a different design is used gor th

Find version of Lucene library

2005-03-08 Thread Paul Mellor
ven't done anything stupid like replaced the wrong JAR! Thanks Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit

Re: Simple Search Question.

2005-03-14 Thread Paul Elschot
> get all the unique values for that, how would I go about it? The normal way is to use IndexReader.terms(Term), passing it a Term constructed from the field name and an empty string, and stopping on the returned TermEnum when the field n

Re: Alert function (aka "profiled alerting")

2005-03-16 Thread Paul Elschot
he alert queries on that index, and then merge the new and the old index, possible deleting old documents in the old index just before merging. - create a filter for the new documents in the updated index and use that in a FilteredQuery for each alert on the new index. For very large indexes

Re: Alert function (aka "profiled alerting")

2005-03-17 Thread Paul Elschot
y minute, > and then destrory the RAMDirectory. Good news pure Lucene and easy to > code, bad news it has the dreaded loop thru all 10k queries. If the order of the queries can be chosen by the implementation, the speed of

Re: boosting?

2005-03-21 Thread Paul Elschot
floating point values. encodeNorm() rounds to a representable value close to the given float, and decodeNorm() returns that representable value, normally used in TermScorer. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL

Performanmce of MultiSearcher?

2005-03-26 Thread Paul Querna
f the lists, and include the List Names as a keyed field? I suspect most searches would be restricted to one or two lists, but I would like good performance if I wanted to search all of the ASF lists. Ideas/Comments? Anyone willing to help me write some C :) ? Thanks,

Re: Deeply nested boolean query performance

2005-04-01 Thread Paul Elschot
in for some surprises, too. skipTo() has the biggest advantages when the index data is not available in any cache. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene on Linux problem...

2005-04-02 Thread Paul Elschot
to Lucene. Otherwise it would probably be useful to inform the maintainers of the JVM. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: EXCEPTION LUCENE 1.4.3 + LUKE

2005-04-04 Thread Paul Elschot
ossible (off the top of my head). I The development version does not the limit of 32 required/prohibited clauses. The maximum number of clauses is in still there. Regards, Paul Elschot > don't know how Luke would get around it either. > > Erik > > > > &

Re: Strategies for updating indexes.

2005-04-05 Thread Paul Smith
your application too, which is very useful for a single instance, and can be easily broken out to be used in a clustered environment. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [

Re: wildcarded phrase queries

2005-04-06 Thread Paul Elschot
gs added above. This normally produces a SpanOrQuery over the added terms and subqueries. Like the boolean clauses, it's advisable to set a maximum to the total number of SpanNearTerms used for a query. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-07 Thread Paul Elschot
ouch with the 'power' users, (via the logs as suggested by Chris) and find out it there are simple measures you can take to help performance for them. For example, replacing a range that is repeatedly used by a cached filter can be quite effective. Regards, Paul Elschot -

Re: Updating Index.

2005-04-07 Thread Paul Elschot
can lookup the old one and check whether it should be deleted or not. Updating documents is done by deletion and insertion and this is best done in batches for efficiency. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Hungarian notation analyzer and phrase queries

2005-04-12 Thread Paul Smith
Analyzer that has the same problem. Go to searchmorph.com and search for "An instance of HashMap has two parameters" and "An instance of Hash Map has two parameters" I realize that with my custom analyzer I can find it without using a phrase query, but it would be nice. Thanks,

Re: zero boost / zero score

2005-04-13 Thread Paul Elschot
core for the query weights, so this is a feature. I think for a document to be counted in the result of a query should only depend on next(), skipTo() and doc() of the scorer, and not on score(). However, other external code already depends on this code in IndexSearcher, so I'm using anot

Re: Hungarian notation analyzer and phrase queries

2005-04-13 Thread Paul Smith
e areas (like wildcard and fuzzy searches). So it sounds like there isn't a perfect solution, but I think the best tradeoff for me is to put them all in the same position unless anyone has more input on the subject? Paul >>> [EMAIL PROTECTED] 04/13/05 11:36AM >>> :

Re: getting the number of occurrences within a document

2005-04-14 Thread Paul Libbrecht
h good speed... I presume one should be able to find the source of this easily. paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Reverting QueryParser ?

2005-04-14 Thread Paul Libbrecht
back). did anyone did this ? thanks paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fields with same name boosting

2005-04-15 Thread Paul Libbrecht
ty to set the boost for all fields of this name for all documents... Reading the book didn't help me there but I may have overlooked. paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: token type question

2005-04-16 Thread Paul Libbrecht
r me, is how to match a+(b+1) when the query is X+Y, ie. subtree cut. Does this occur in chemical formulae as well? paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Passing XML objects to the analyzer ?

2005-04-19 Thread Paul Libbrecht
e-analysis step which converts it into tokens of text which then my analyzer catches again. I'd be more inclined for the first solution but I fear there's a catch. Is there one ? paul - To unsubscribe, e-mail: [EMAIL

Re: Passing XML objects to the analyzer ?

2005-04-19 Thread Paul Libbrecht
e passed around till the analyzer call which would then decide to accept, say, JDOM objects... paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring, cosine measure

2005-04-20 Thread Paul Elschot
uted under the hood by counting the indexed terms. Its square root is stored as a single byte value in a special representation with 3 bits mantissa and 5 bits exponent. > Has anyone tried an index based on n-grams? Nutch has bigrams for phrases with frequently occurring words. Regar

Re: Passing XML objects to the analyzer ?

2005-04-20 Thread Paul Libbrecht
. What I could read thus far about position-increment to offer alternatives seems to be limited to one word. paul Le 20 avr. 05, à 15:32, Vanlerberghe, Luc a écrit : The problem with this approach is that the Analyser you will use for indexing will be *very* different from the one used for searchi

Querying things "close to" a gang of terms ?

2005-04-22 Thread Paul Libbrecht
n the sense of this metric.. As far as I can see, Lucene provides me nothing for this... The LuceneBook only shows an example where all distances a pre-computed. Did I miss something ? Is there another tool ? Maybe infomap ? thanks paul ---

Re: token type question

2005-04-22 Thread Paul Libbrecht
ng that tackles me is how much this parameter could, again, be something different... In particular, I'd much prefer to have it a tree-path instead of a plain number. I don't have reader plain numbers and they are, often, lost in an XML co

Re: Indexing of virtual "made up" documents

2005-04-26 Thread Paul Libbrecht
on stuff at all. There are some information retrieval settings which tend to say that things that appear early in the document should be considered with greater score... is there nothing such in Lucene's scoring ? paul - To

Re: multi word synonym

2005-04-26 Thread Paul Libbrecht
topic. Is there a hope this becomes different in Lucene 1.9 or 2.0 ?? My dream would be to have the position increments living in a tree... you know and... XML tree... thanks paul Le 26 avr. 05, à 15:22, Madhu Sasidhar, MD a écrit : I have found the previous discussions on multi word synonyms a

Re: Re[2]: multi word synonym (was Hungarian notation analyzer and phrase queries)

2005-04-29 Thread Paul Smith
rver), you don't have a rule to rewrite it to help you find jsp. Paul >>> [EMAIL PROTECTED] 04/27/05 05:51AM >>> Hello, What about the solution to index every multi-word synonym as a single token ? Example : Phrase to index : "i love jsp and tomcat" Synonyms

Re: Re[2]: multi word synonym (was Hungarian notation analyzer and phrase queries)

2005-04-29 Thread Paul Libbrecht
I knew there was a catch... I do think, however, that the point is a delicate one which would consideration: multi-word synonyms are quite common! paul Le 29 avr. 05, à 18:47, Paul Smith a écrit : Indexing every multi-word synonym as a single token would introduce spaces into the tokens. In that

Re: CVS Lucene 2.0

2005-05-02 Thread Paul Elschot
ly, but it looks like > BooleanScorer1 could be a replacement for both BooleanScorer and > BooleanScorer2. > http://issues.apache.org/bugzilla/show_bug.cgi?id=33019 It's not stated there, but BooleanScorer2 currently de

Re: ArrayIndexOutOfBoundsException on BooleanScorer.score()

2005-05-06 Thread Paul Elschot
on prohibited clauses in the query. Could you indicate which query you are using? And in case you find a way to reproduce this in a test case, could you file a bug report in bugzilla? Regards, Paul Elschot. > > -- m@ > > > > > He

Re: ArrayIndexOutOfBoundsException on BooleanScorer.score()

2005-05-06 Thread Paul Elschot
BooleanQuery, BooleanScorer and QueryParser.jj), but I couldn't find any. Regards, Paul Elschot. > > I'll see about getting a test case, but like I said it doesn't happen > every time so I've had a hard time tracking down the problem. > > -- m@ > > > The

Re: ArrayIndexOutOfBoundsException on BooleanScorer.score()

2005-05-07 Thread Paul Elschot
t). It took some sleep to realize this: This exception can happen when a scorer is add()'ed to the BooleanScorer after the query search has begun. Given that it is difficult to reproduce, the odds are that there two threads not properly synchronized: one add()ing to the BooleanScorer and on

Re: Search Theory Book

2005-05-12 Thread Paul Libbrecht
How about: http://www.dcs.gla.ac.uk/Keith/Preface.html quite an old one but a recognized one, I think. Also, browse http://www.lt-world.org/ I think. paul Le 12 mai 05, à 14:04, Pasha Bizhan a écrit : Hi, Managing Gigabytes http://www.amazon.com/exec/obidos/tg/detail/-/1558605703/ qid

Re: question about IndexWriter.maxFieldLength

2005-05-17 Thread Paul Elschot
ason for the 10.000 terms limitation is to have an upperbound the memory used for indexing a single document. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: NFS

2005-05-18 Thread Paul Libbrecht
share the index among several machines, but, as Otis said, that doesn't seem to be the requirement here. Just a hint that we have experienced using Lucene Indexes on NFS partitions to be much much slower than local partitions... asid

Re: a "real" PhrasePrefixQuery

2005-05-20 Thread Paul Elschot
r all SpanTermQuery's for terms matching tri*. The last one should be a SpanPrefixQuery, but that one is not available. Have a look in PrefixQuery.rewrite() on how to find all terms matching tri*, it's fairly straightforward. Regards, Paul Elschot

Re: mutiple index question

2005-05-20 Thread Paul Elschot
this, so you'll have to split your docs over two different Lucene indexes and adapt the search accordingly. Cached filtering helps a lot, but setting up a filter can still be costly. Regards, Paul Elschot - To unsubscribe,

Re: mutiple index question

2005-05-20 Thread Paul Elschot
On Friday 20 May 2005 16:21, Robert Newson wrote: > Paul Elschot wrote: > > On Friday 20 May 2005 13:58, Max Pfingsthorn wrote: > > > >>Hi! > >> > >>I was wondering if Lucene has any sort of functionality to distribute > > > > indices so th

Re: a "real" PhrasePrefixQuery

2005-05-20 Thread Paul Elschot
On Friday 20 May 2005 17:20, Terry Steichen wrote: > Paul, > > Could you flesh out the implementation you describe below with some code > or pseudocode? You can start from this: http://issues.apache.org/bugzilla/show_bug.cgi?id=34331 and use code from the method

Re: Question regarding boosting

2005-05-21 Thread Paul Elschot
the query parser and to construct the (nested) boolean query in the program code by adding optional and required clauses to the boolean queries. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

work on joins ?

2005-05-24 Thread Paul Libbrecht
Hi, from time to time it really looks like it would be useful to be searching for something that has, say, term-x, which is the same as the term matched in another part of the query...i.e. joins. Has there been work done on this in Lucene ? thanks paul

Re: Finding docs which contain at least x of the queryterms

2005-05-25 Thread Paul Elschot
nctionScorer for the minimum number of matchers. The constructor parameter is not used (even in the trunk), so you'll have to write the code to use it yourself. I'd recommend to start from the trunk and extend BooleanQuery for this. Regards, Paul Elschot -

Re: search optimization - help

2005-05-25 Thread Paul Elschot
lue of B is not checked and C > is set to true. The value of B is checked only if A is false. > I guess this is not what lucene does as it has to calculate the score > for the document. Am i right? Yes. > If yes, I there any way I can do it using the normal scenario? One would also need

Re: Stemming at Query time

2005-05-31 Thread Paul Libbrecht
can make a phrase-query with possible synonyms for phrase-constituents, you'd need to OR the queries with each set of possible variations (that grows quick! but do you know many people that put large phrase queries?) paul Le 30 mai 05, à 18:54, Andrew Boyd a écrit : Hi All, Now that

Re: Indexing multiple keywords in one field?

2005-05-31 Thread Paul Libbrecht
the book and not in the javadoc and I'd recommed adding it in the javadoc of the add method, it's a non-obvious goodness which suits all forms of scalability! paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing multiple languages

2005-06-01 Thread Paul Libbrecht
it may be advantageous to search all languages at once. This one may need particular treatment. Tell us your success! paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing multiple languages

2005-06-03 Thread Paul Libbrecht
include the language (e.g. title_en, title_cn) I quite like 4, because you can search with no language constraint, or with one as Paul suggests below. You can in both cases. In the second, you need to expand the query (ie searching for carrot would search text_en:carrot or text_cn:carrot

Re: RE REQUEST: SPECIFIC HIT

2005-06-06 Thread Paul Elschot
= +KEYSRC:Digital +KEYSRC:Camera +KEYSRC:Cabel -KEYSRC:(BATTERY > ACCESSORIES CABEL APPERAL) > > The resultant hit would be the 3rd doc instead of 3rd and 5th.. > > > The Problem here is of 2 conditions > > 1) Search could be DIGITAL CAMERA 0PTICS

Re: Relative term frequency?

2005-06-07 Thread Paul Elschot
up Lucene to allow for this? Have a look here: http://issues.apache.org/bugzilla/show_bug.cgi?id=31784 It scores terms by density and it uses a separate table mapping the norms stored in the index to inverse doc lengths. This table could be adapted as needed. When that is not enough, it'

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Paul Elschot
etting the doc numbers for all indexes, then sorting these per index, then retrieving them from all indexes, and repeating the whole thing using terms determined from the retrieved docs. With the indexes on multiple discs, some parallellism can be introduced. A thread per disk could be u

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Paul Elschot
On Tuesday 07 June 2005 09:22, Paul Elschot wrote: ... > > With the indexes on multiple discs, some parallellism can be introduced. > A thread per disk could be used. > In case there are multiple requests pending, they can be serialized just > before the sorting of the terms, and

Re: Documents returned by Scorer

2005-06-07 Thread Paul Elschot
In the development version all scorers implement skipTo. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Flushing IndexWriters and IndexReaders

2005-06-07 Thread Paul . Illingworth
version number of the index get updated and could I use this (if I recorded the version of the index that was last optimised) to determine how much activity there had been on an index? Regards Paul I. - To unsubscribe, e-mail: [EMAIL

Re: use of LinkedList in ConjunctionScorer hurting performance?

2005-06-07 Thread Paul Elschot
ing around here, and it would need a bit of tinkering before posting. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-08 Thread Paul Elschot
On Wednesday 08 June 2005 01:18, Kevin Burton wrote: > Paul Elschot wrote: > > >For a large number of indexes, it may be necessary to do this over > >multiple indexes by first getting the doc numbers for all indexes, > >then sorting these per index, then retrieving them

Re: Doing a Join across indexes [was Documents returned by Scorer]

2005-06-08 Thread Paul Elschot
On Wednesday 08 June 2005 01:30, Matt Quail wrote: > > On 08/06/2005, at 1:33 AM, Paul Elschot wrote: > > > On Tuesday 07 June 2005 11:42, Matt Quail wrote: > > > >> I've been playing around with a custom Query, and I've just realized > >> tha

Re: OR query on multiple fields causes low coord

2005-06-09 Thread Paul Elschot
ry constructed there, use a BooleanQuery that overrides getSimilarity(). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: DBSight, search on database by Lucene

2005-06-12 Thread Paul Querna
control over. Fixing it so httpd can cache fixes upstream proxies too, so it is the right thing to do. -Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ideas Needed - Finding Duplicate Documents

2005-06-13 Thread Paul Libbrecht
Have you tried comparing TermVectors ? I would expect them, or an adjustment of them, to allow comparison to focus on "important terms" (e.g. about a 100-200 terms) and then allow a more reasonable computation. paul Le 12 juin 05, à 16:37, Dave Kor a écrit : Hi, I would like t

Re: Performance with multi index

2005-06-16 Thread Paul . Illingworth
I guess that if you have 10 indexes each with a merge factor of 10 with documents evenly distributed across those indexes then on average there will be a merge every 100 documents. If you have a single index there will be a merge every 10 documents. If you increase your merge factor from 10

Re: Search Hit frequency and location

2005-06-16 Thread Paul Elschot
look at TestPhraseQuery.java in the src/test directory. > Alternatively, any suggestions on what to google, or where to look to > educate myself would be welcome as well. TermQuery and TermScorer make a good starting point. To save some reading, ignore the explain() methods initially. >

About the field of PhraseQuery

2005-06-17 Thread Paul Libbrecht
? paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: About the field of PhraseQuery

2005-06-18 Thread Paul Elschot
On Friday 17 June 2005 22:27, Paul Libbrecht wrote: > hi, > > I spent an hour today to make my field-name feed correctly into my > phrase-query. A ridiculous bug of mine. Debugging experience seemed to > indicate that the field was the field of the first term sent. > >

Lucene scoring bounds ??

2005-06-19 Thread Paul Libbrecht
s but... how can I be sure of this ? thanks paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: About the field of PhraseQuery

2005-06-19 Thread Paul Libbrecht
So why is there no such constructor PhraseQuery(String fieldName) and a method add(Token tok) ?? That would be much more intuitive I feel! paul Le 18 juin 05, à 09:44, Paul Elschot a écrit : It will throw an IllegalArgumentException when a Term is added with a different field, which is

Re: About the field of PhraseQuery

2005-06-20 Thread Paul Elschot
On Monday 20 June 2005 08:57, Paul Libbrecht wrote: > So why is there no such constructor > PhraseQuery(String fieldName) > and a method >add(Token tok) > ?? Tradition? > That would be much more intuitive I feel! Regards, Paul Elschot > > paul > >

Re: Span query performance issue

2005-06-25 Thread Paul Elschot
y is slower than PhraseQuery, and I'd expect a factor 3-4 between them. The factor 8 might indicate that there is some room for improvement in the span package. (I'd expect the CellQueue in NearSpans to be the bottleneck.) Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index Replication / Clustering

2005-06-26 Thread Paul Smith
hout the main application knowing anything about it. Paul Smith On 26/06/2005, at 2:35 AM, Stephane Bailliez wrote: I have been browsing the archives concerning this particular topic. I'm in the same boat and the customer has clustering requirements. To give some background: I ha

Re: Index Replication / Clustering

2005-06-27 Thread Paul Smith
uest may not give the same result depending on the node it is load-balanced, correct ? In this case we will manually mark the node via Apache worker configs to be be disabled until it has caught up. Paul - To unsubsc

Re: Index Replication / Clustering

2005-06-27 Thread Paul Smith
pdates is relatively small (as fast as a human can upload things), so we are in a fortunate position in our case. Our guarantees are in the order of minutes rather than seconds. Paul

no EnglishAnalyzer ?

2005-06-28 Thread Paul Libbrecht
anguages can be found in the contribs directory. Any reason I cannot find an "EnglishAnalyzer" and an EnglishStemmer ? I don't think the other analyzers I could find (e.g. StandardAnalyzer) are based on stemmers. thanks paul -

Re: no EnglishAnalyzer ?

2005-06-29 Thread Paul Libbrecht
Le 29 juin 05, à 00:57, Erik Hatcher a écrit : Paul - if stemming is what you're looking for, then grab the SnowballAnalyzer code from Subversion under contrib/snowball. Or you could get a binary copy of the JAR from the source code distribution of Lucene in Action at

Re: newbie question on Mac OS X

2005-06-29 Thread Paul Libbrecht
Which main class would you expect to run ? I don't think there's one. Lucene is a library. paul PS: this has nothing MacOSX specific Le 29 juin 05, à 10:12, Xing Li a écrit : 1) Downloaded 1.4.3 src 2) ran ant... everything builds 3) $ cd builds 4) $ java -jar lucene-1.5-r

Re: lucene query

2005-06-30 Thread Paul Libbrecht
of A and B are equal XML trees. But maybe there's something else in your XML that you wish to retrieve... paul Le 30 juin 05, à 02:54, eshwari pss a écrit : Does Lucene support XML searching? - I mean not treating the x

Re: Sentence and Paragraph searching

2005-07-01 Thread Paul Elschot
ing, for example using special characters or > as a separate field in Lucene. After every search, do an extra check to ensure > that Lucene did not match across sentence boundaries. Or try and use a SpanNotQuery to make sure that the sentence or paragraph border is contained in matches i

Re: Sentence and Paragraph searching

2005-07-01 Thread Paul Elschot
extending the index format with index levels: one for normal use, one for sentences, one for paragraphs, ... . Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 indexing and searching

2005-07-01 Thread Paul Libbrecht
... Hope that helps. paul Le 1 juil. 05, à 22:41, <[EMAIL PROTECTED]> a écrit : Did you check that the request string you get at the analyzer level is corectly encoded as UTF-8? We had the same problem with french accentuated char encoded also as UTF-8, and transmited by tomcat as ISO-885

Re: Unexpected: ordered

2005-07-03 Thread Paul Elschot
the document to try and get to the original text that causes this exception, and use that to file a bug report? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most Useful Lucene Taglib?

2005-07-05 Thread Paul Libbrecht
Le 5 juil. 05, à 03:45, Chris Fellows a écrit : IS there a strong web client user base of Lucene? I would estimate this to at least 50% of, say, the java-user@lucene.apache.org mailing-list, really a personal guess, though. paul

Re: Boosting SpanQueries

2005-07-07 Thread Paul Libbrecht
Enclosing it in a boolean-query where its alone and which, itself, has a boosting would seem to work for me... paul Le 7 juil. 05, à 11:04, Vincent Le Maout a écrit : a way to implement something as boosting allowing to enhance the score of documents containing a particular word of a span

Re: Search Timeout - abort a search

2005-07-07 Thread Paul Elschot
collect method until it finished. > > Is there something else I'm missing? To stop the search from a HitCollector just throw an IOException or an Error and catch it where the search was started. Since most searching in Lucene already throws IOException you might try and use a subclass of I

Index Partitioning ( was Re: Search deadlocking under load)

2005-07-08 Thread Paul Smith
omatically closed? Appreciate any thoughts on this. I'd rather know now while I have the opportunity to change the design than later when in production.. :) cheers, Paul Smith On 09/07/2005, at 5:39 AM, Otis Gospodnetic wrote: Nathan, 3) is the recommended usage. Your index is on an

Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-10 Thread Paul Smith
ut if your code look like this... Searcher s = new IndexSearcher(IndexReader.open(foo)) ...then you are screwed, because nothing will ever close that reader and free it's resources. That was my initial thought when Nathan outlined is issue. I've seen that happen before myself. Paul

  1   2   3   4   5   6   7   8   9   10   >