Re: Intel I7 benchmark request.

2010-01-20 Thread Yonik Seeley
On Wed, Jan 20, 2010 at 11:21 AM, Dawid Weiss wrote: > Is there anyone with access to an Intel I7-machine? I'd be curious > what the results of this benchmark are, given the new JVM intrinsics > introduced in HotSpot 1.7: FYI, the AMD Phenom also has the POPCNT instruction. -Yonik http://www.luc

Re: Nasty NIO behavior makes NIOFSDirectory silently close channel

2010-01-28 Thread Yonik Seeley
On Thu, Jan 28, 2010 at 8:24 AM, Michael McCandless wrote: > Bummer. > > So the only viable workarounds are 1) don't use Thread.interrupt (nor, > things like Future.cancel, which in turn use Thread.interrupt) with > NIOFSDir, or 2) we fix NIOFSDir to reopen the channel AND the app must > make a de

Re: Nasty NIO behavior makes NIOFSDirectory silently close channel

2010-01-28 Thread Yonik Seeley
On Thu, Jan 28, 2010 at 3:49 PM, Grant Ingersoll wrote: > Could we get the Channel (and other necessary classes) implementation from > Apache Harmony Public JDK methods don't have the low level stuff to implement our own, so rolling our own or using Harmony would require native code... not a goo

Re: FieldCacheImpl concurrency

2010-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2010 at 9:54 AM, Shay Banon wrote: >    I would like to try and improve concurrency in Lucene in several places, > and thought I would start with FieldCacheImpl. The implementation is heavily > synchronized on both a global map and on creation values for a pretty > heavily used pat

Re: FieldCacheImpl concurrency

2010-02-12 Thread Yonik Seeley
On Fri, Feb 12, 2010 at 1:50 AM, Shay Banon wrote: > On Thu, Feb 11, 2010 at 5:41 PM, Yonik Seeley >> It really shouldn't be heavily used. >> For a sorted search, get() is called once per segment in the index. >> There is no synchronization to retrieve per-document valu

Re: Uwe's question

2010-02-26 Thread Yonik Seeley
On Fri, Feb 26, 2010 at 3:31 PM, Jason Rutherglen wrote: >> I've never tried to learn a command-line invocation of a test >> case for a single test method, I've always just used the IDE >> to run individual methods > > Right, I've been doing bunches of Solr dev which for me only works > from t

Re: Committer permissions

2010-03-14 Thread Yonik Seeley
I've merged JIRA permissions - let me know if I accidentally left anyone out. Looking forward to the communities working more closely again! (go Robert, go! ;-) -Yonik On Sun, Mar 14, 2010 at 10:44 AM, Grant Ingersoll wrote: > Per the vote on general@ to merge committers, I've given Lucene and

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-14 Thread Yonik Seeley
On Sun, Mar 14, 2010 at 5:47 PM, Mark Miller wrote: > On 03/14/2010 06:37 PM, Grant Ingersoll wrote: >> >> On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote: >> >> >>> >>> This time a +1 without discuss :-) >>> >> >> Yeah, but Uwe, the thread was DISCUSS, not VOTE!  :-) >> > > I had a whole spiel a

lucene and solr trunk

2010-03-15 Thread Yonik Seeley
Due to a tremendous amount of work by our newly merged committer corps, the get-on-lucene-trunk branch (branches/solr) is ready for prime-time as the new solr trunk! Lucene and Solr need to move to a common trunk for a host of reasons, including single patches that can cover both, shared tags and

Re: lucene and solr trunk

2010-03-16 Thread Yonik Seeley
On Tue, Mar 16, 2010 at 2:51 AM, Michael Busch wrote: > Also, we're in review-and-commit process, not commit-and-review.  Changes > have to be > proposed, discussed and ideally attached to jira as patches first. Correction, just for the sake of avoiding future confusion (i.e. I'm not making any

Re: lucene and solr trunk

2010-03-16 Thread Yonik Seeley
On Tue, Mar 16, 2010 at 5:42 AM, Michael McCandless wrote: > I think it like the 1st option best (lucene moves as subdir to solr's > current trunk SVN path), but I don't feel strongly. > > This'd mean one could simply checkout lucene alone and do everything > you can do today. > > But if you check

Re: #lucene IRC log [was: RE: lucene and solr trunk]

2010-03-16 Thread Yonik Seeley
IRC has been discussed to death at Apache: http://markmail.org/search/?q=IRC+list%3Aorg.apache.incubator.general Look for the spikes... like this: http://markmail.org/search/?q=IRC+list%3Aorg.apache.incubator.general#query:IRC%20list%3Aorg.apache.incubator.general%20date%3A200608%20+page:1+state:

Re: lucene and solr trunk

2010-03-16 Thread Yonik Seeley
On Tue, Mar 16, 2010 at 5:42 PM, Jake Mannix wrote: > On Tue, Mar 16, 2010 at 2:31 PM, Michael McCandless > wrote: >> >> If we move lucene under Solr's existing svn path, ie: >> >>  /solr/trunk/lucene > > Chiming in just a bit here - isn't there any concern that independent of > whether or not pe

Re: lucene and solr trunk

2010-03-16 Thread Yonik Seeley
On Tue, Mar 16, 2010 at 6:01 PM, Jake Mannix wrote: > I'm not concerned with casual downloaders.  I'm talking about the companies > and people who > may or may not be interested in making multi-million dollar decisions > regarding using or > not using Lucene or Solr. Heh - multi-million dollar de

Re: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java

2010-03-18 Thread Yonik Seeley
E, let's strive for slightly better commit messages ;-) -Yonik On Thu, Mar 18, 2010 at 7:48 AM, wrote: > Author: uschindler > Date: Thu Mar 18 11:48:11 2010 > New Revision: 924731 > > URL: http://svn.apache.org/viewvc?rev=924731&view=rev > Log: > LUCENE-2326: As rmuir seems to bug me about t

Re: Mailing List merge

2010-03-22 Thread Yonik Seeley
wait, wait... no... [email protected] -Yonik - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Re: Branding Solr+Lucene

2010-03-22 Thread Yonik Seeley
On Mon, Mar 22, 2010 at 2:20 PM, Ryan McKinley wrote: > I'm confused... what is the need for a new name?  The only place where > there is a conflict is in the top level svn tree... Agree, no need to re-brand. > What about something general like: > https://svn.apache.org/repos/asf/lucene/dev > or

Re: Branding Solr+Lucene

2010-03-22 Thread Yonik Seeley
On Mon, Mar 22, 2010 at 2:25 PM, Shai Erera wrote: > To the best of my knowledge, it > hasn't been decided that Lucene and Solr merge and become ONE thing Depends on what the meaning of "thing" is ;-) We have merged into one development project. But Lucene and Solr as separate downloads will re

Re: New LuSolr trunk (was: RE: (LUCENE-2297) IndexWriter should let you optionally enable reader pooling)

2010-03-23 Thread Yonik Seeley
For Solr, we can just move the current trunk to a 15 branch. -Yonik On Tue, Mar 23, 2010 at 9:39 AM, Grant Ingersoll wrote: > > On Mar 22, 2010, at 8:27 AM, Uwe Schindler wrote: > >> Hi all, >> >> the discussion where to do the development after the merge, now gets actual: >> >> Currently a luso

Re: Merge Status

2010-03-23 Thread Yonik Seeley
Of you have checkouts of the previous trunks that you don't want to re-checkout, then use svn switch. Solr trunk was moved to a 1.5 branch, so for old trunk checkouts, cd into your directory and do svn switch https://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev For "newtrunk" chec

Re: Merge Status

2010-03-23 Thread Yonik Seeley
On Tue, Mar 23, 2010 at 10:49 AM, Grant Ingersoll wrote: > > On Mar 23, 2010, at 10:09 AM, Grant Ingersoll wrote: > >> >> 3. Other nightly build stuff.  My cron tabs, etc.  I will update them to >> point at the new trunk. > > OK, I updated my cron tab for the site check out of Lucene.  Not sure w

Re: Running the Solr/Lucene tests failed

2010-03-23 Thread Yonik Seeley
On Tue, Mar 23, 2010 at 5:07 PM, Michael Busch wrote: > OK I reran the tests sequentially with my LUCENE-2329 patch applied.  The > same test failed again: > > [junit] Test org.apache.solr.client.solrj.embedded.JettyWebappTest FAILED > > > Everything else looks good.  So it should be ok to commit

Re: svn commit: r929520 - in /lucene/dev/trunk/lucene/contrib/benchmark: ./ src/java/org/apache/lucene/benchmark/byTask/utils/ src/test/org/apache/lucene/benchmark/byTask/utils/

2010-03-31 Thread Yonik Seeley
On Wed, Mar 31, 2010 at 9:06 AM, wrote: > JIRA-2353: Config incorrectly handles Windows absolute pathnames The JIRA-2353 part is my fault - Shai asked me the right format for JIRA linking, and I got it wrong! Should be LUCENE-2353 of course. need...more...coffee... -Yonik

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 10:39 AM, DM Smith wrote: > Maybe have the index store the version(s) and use that when constructing a > reader or writer? That would cause a reindex to change behavior (among other problems). -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague

Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 11:06 AM, Chris Male wrote: > While having fewer boxes means fewer term queries to make against the index, > more documents means more costly calculations to filter out those extraneous > documents. Filtering out documents (greater selectivity) seems like it should be the

Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 12:12 PM, Chris Male wrote: > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll >> On Apr 14, 2010, at 11:06 AM, Chris Male wrote: >> > For those doing just Cartesian Tier filtering it seems like the new >> > approach is a win, but for those doing distance calculations on t

Re: Proposal about Version API "relaxation"

2010-04-15 Thread Yonik Seeley
Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot wrote: > I like the idea of index conversion tool over silent online upgrade > beca

Re: Proposal about Version API "relaxation"

2010-04-15 Thread Yonik Seeley
On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot wrote: > On Thu, Apr 15, 2010 at 17:17, Yonik Seeley > wrote: >> Seamless online upgrades have their place too... say you are upgrading >> one server at a time in a cluster. > > Nothing here that can't be solved w

Re: Proposal about Version API "relaxation"

2010-04-15 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 5:22 PM, Michael McCandless wrote: >  * There is no back compat across major releases (index nor APIs), >    but full back compat within branches. > > This would match how many other projects work (KS/Lucy, as Marvin > describes above; Apache Tomcat; Hibernate; log4J; FreeB

Re: subclasses of abstract Query class are not implementing all methods

2005-03-10 Thread Yonik Seeley
Great idea. I am also planning on doing some caching based on query objects, and I hadn't realized that these weren't implemented. In the meantime, I guess I can cache an Object that contains a Query, and traverse the query tree myself to implement missing .hashCode() or .equals() functionality.

Re: subclasses of abstract Query class are not implementing all methods

2005-03-11 Thread Yonik Seeley
> Having a clean implementation of toString(), one would be able to > reparse a Query with QueryParser. This is something which was discussed > several times on the dev and user lists. I expect that it will work for > all queries supported by QueryParser. Only under certain restrictions. Remember

scalability w/ number of fields

2005-04-04 Thread Yonik Seeley
I know Lucene is very scalable in many ways, but how about number of fieldnames? We have an index using around 6000 unique fieldnames, 450,000 documents, and a total index size of 4GB. It's very sparse... documents don't have that many fields, but the number of different fieldtypes is huge. An

Re: scalability w/ number of fields

2005-04-05 Thread Yonik Seeley
=1.4GB optimize_time=4min It's a little apples-to-oranges since we simply removed some of the fields to test a lower field count (and hence the index size also goes down). -Yonik On Apr 4, 2005 5:35 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > I know Lucene is very scalable in

BooleanQuery.equals() change

2005-04-11 Thread Yonik Seeley
Erik, why was the last change to BooleanQuery made? The comment was "Correct BooleanQuery.equals such that every clause is compared". It looks like Vector.equals() should have worked, and the new code is probably slower as it creates two new arrays. We actually need to do some caching with Query

Re: BooleanQuery.equals() change

2005-04-12 Thread Yonik Seeley
> On Apr 11, 2005, at 5:57 PM, Yonik Seeley wrote: > > Erik, why was the last change to BooleanQuery made? > > The comment was "Correct BooleanQuery.equals such that every clause is > > compared". > > > > It looks like Vector.equals() should have worked,

UnscoredRangeQuery

2005-04-14 Thread Yonik Seeley
OK, so I implemented an UnscoredRangeQuery we needed for use with lucene 1.4.3. Seems to work fine for me, so I thought I would put it out here to see what you guys think... (files attached) Would a cleaned up version be useful for some version of Lucene, or will all the current work that Paul is

Re: UnscoredRangeQuery

2005-04-15 Thread Yonik Seeley
Paul, your response reminded me of a case I forgot to consider: multi-valued fields (the same document matching more than one term in the range). In that case, my scorer will produce duplicates. I assume that a scorer that produces dups is not allowed. It looks like I don't really have any other

Re: UnscoredRangeQuery

2005-04-21 Thread Yonik Seeley
OK, so as I said, my previous version of UnscoredRangeQuery that could work with any number of terms in the range had a problem - it could return duplicates if a doc had more than one term in the range. Here is how I fixed it: I hacked together an UnscoredQuery that takes a Filter (it's like Filte

Re: UnscoredRangeQuery

2005-04-26 Thread Yonik Seeley
> ConstantScoreQuery would seem better, with the addition of the constant > score value as a constructor argument. OK, I changed the names to ConstantScoreQuery and ConstantScoreRangeQuery. Should I add a constantScore field to the class, or just rely on the boost (and have the user call setBoos

Re: build process changes

2005-05-05 Thread Yonik Seeley
Definitely. That's exactly our usecase... copying all the jars we need (or perhaps all of them) into the lib directory of our webapp. -Yonik On 5/5/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > I think that all of the jars we create should be prefixed with 'lucene-' > and end with the version.

Re: FieldCache parser

2005-05-10 Thread Yonik Seeley
This does solve one problem I was having. There are still a few issues I still need to solve: - double and long support? - Sorting support for multiple indexed fields mapped onto a single field using fieldname=fieldvalue.For example, when field "x" is specified, I actually need just a slice

Re: constant scoring queries

2005-05-10 Thread Yonik Seeley
Hey now... you're going to obsolete all my in-house code and put me out of a job ;-) Could you elaborate on the advantage of having say a TermQuery that could be either normal-scoring or constant-scoring vs two different Query classes for doing this? They seem roughly equivalent. > 1. Add two m

Re: optimized reopen?

2005-05-11 Thread Yonik Seeley
Things are cached using an IndexReader as the key, so you would have to be careful not to break the current behaviour (that an IndexReader's view of an index doesn't change - deletes from that specific reader aside). Maybe you could invoke reopen() on an existing IndexReader and it would return a

Re: [Performance]: IndexWriter again...

2005-05-16 Thread Yonik Seeley
I like the idea Paul. As far as how it should be implemented, perhaps a count of docs in memory should be kept. It doesn't seem necessary to traverse all of the segments on every add (it's a linear operation, and will only result in a merge every "minMergeDocs" or "maxBufferedDocs"). -Yonik On

Re: optimized reopen?

2005-05-16 Thread Yonik Seeley
Oops, this fell off lucene-dev... re-adding it. -Yonik On 5/16/05, Yonik Seeley <[EMAIL PROTECTED]> wrote: > In general, people shouldn't be relying on GC to clean up non-memory > resources. If you don't explicitly call close() on an IndexReader > when you are done with

Re: constant scoring queries

2005-05-17 Thread Yonik Seeley
> To have the actual implementation of java.util.BitSet > in the interface is not really nice. Totally agree. > The FilteredQuery here: > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965 > has two constructors, one for a Filter that provides a BitSet, > and one for SkipFilter that provides

Re: constant scoring queries

2005-05-18 Thread Yonik Seeley
> > > contains(docid) and exists(docid) cannot be efficiently implemented > > > on a VInt based compact filter, so I'd prefer to leave these out. > > > > exists() on a BitSet is much faster than next() though... > > Yes, but the point is to iterate to the next document based in information > from

Re: major searching performance improvement

2005-05-25 Thread Yonik Seeley
Looks like really great stuff Robert! > 2. I agree that creating NioFSDirectory rather than modifying FSDirectory. I > originally felt the memory mapped files would be the fastest, but it also > requires OS calls, the "caching" code is CONSIDERABLY faster, since it does > not need to do any JNI, o

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
The temporary char[] buffer is cached per InputStream instance, so the extra memory allocation shouldn't be a big deal. One could also use String(byte[],offset,len,"UTF-8"), and that creates a char[] that is used directly by the string instead of being copied. It remains to be seen how fast the

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> How will the difference impact String memory allocations? Looking at the > String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length > chars.length) chars =

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> How will the difference impact String memory allocations? Looking at the > String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length > chars.length) chars =

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> Sure you can. Do a "tell" to get the position. Write any number. The representation of the number is variable sized... you can't use a placeholder. -Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> I think you guys are WAY overcomplicating things, or you just don't know > enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point, and CharsetEncoder doesn

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
On 8/30/05, Robert Engels <[EMAIL PROTECTED]> wrote: > > Not true. You do not need to pre-scan it. What I previously wrote, with emphasis on key words added: "one has to *either* buffer the entire string, *or* pre-scan it." -Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. -Yonik Now hiring -- http://tinyurl.com/7m67g On

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> The inefficiency would be if prefix were re-converted from UTF-8 > for each term, e.g., in order to compare it to the target. Ahhh, gotcha. A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right?

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For sorting queries, one is able to specify a Locale. -

time for "ant test"?

2005-09-20 Thread Yonik Seeley
I'm using Lucene 1.9 for the first time (migrating from Lucene 1.4.3) and building from the latest version in SVN. How long should "ant test" take? It seems to progress normally, then freezes after TestWildcard (have been waiting like 20 minutes). The CPU is still spinning doing something... ja

Re: Lucene and UTF-8

2005-09-21 Thread Yonik Seeley
How does this patch work w.r.t. the length vint? It looks like the length is still the number of 16 bit java chars, but the encoding is now correct UTF-8? -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/21/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Sep 20, 2005, at 11:53 PM, Chris

incorrect score normalization in hits

2005-09-21 Thread Yonik Seeley
Hits does normalization based on the score of the first document "scoreDocs[0].score" This is a problem if sort is on anything other than score, since the first document won't necessarily be the highest scoring. I propose fixing this by adding a field to TopDocs called "maxScore", and using that i

Re: UInt32 or Int32

2005-09-22 Thread Yonik Seeley
I'd lean toward keeping UInt32 in general, so at least that will scale to 4B documents. SegSize is the only place where UInt32 is used that it will matter (all of the other uses will never approach that size). writeInt() writes both signed and unsigned integers (or rather the bit pattern could be

Re: TokenFilters eating position increments

2005-09-22 Thread Yonik Seeley
> Thoughts? LOL! You're psychic. http://issues.apache.org/jira/browse/LUCENE-438 -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/22/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > Yonik identified an interesting issue with LUCENE-437 - http:// > issues.apache.org/jira/browse/LUCENE-437

Re: incorrect score normalization in hits

2005-09-23 Thread Yonik Seeley
Never mind... my mistake. FieldSortedHitQueue takes care of tracking maxscore and normalizes the score in fillFields(). -Yonik On 9/21/05, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Hits does normalization based on the score of the first document > "scoreDocs[0].score&quo

score based on field value

2005-09-23 Thread Yonik Seeley
There has been a lot of interest in generating a score or boosting based on the value of a particular field. Here is my first prototype that can handle int and float field values. I'm not particularly happy with the form of this solution yet, which is why I'm throwing it out to dev to see if anyone

constant score queries / function scores and idf

2005-09-26 Thread Yonik Seeley
For something like ConstantScoreRangeQuery, or a query that scores based on field contents, it doesn't make sense to have an idf() (and hence I leave it out). However, I'm worried about how these constant scores will vary with respect to other clauses when the index changes size. Does it make sens

Re: score based on field value

2005-09-26 Thread Yonik Seeley
Here is another prototype. Field value retrieval has been decoupled from function calculation, so one will be able to use a field of any type with the same function class. Comments? If not, I think this may be the version I clean up, implement sources for Integer and Ordinal fields, and submit to

Re: Eliminating norms ... completley

2005-10-07 Thread Yonik Seeley
I'm approaching it the same as term vectors... make the norms optional on a per field basis. My first cut is returning a dummy norm array filled in with the equiv of 1.0f so I didn't have to go modify all the queries/weights/scorers that retrieve the norms. For best performance, those scorers shoul

Re: Eliminating norms ... completley

2005-10-08 Thread Yonik Seeley
ward compatible with custom queries and scorers that we don't have access to. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/7/05, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > I'm approaching it the same as term vectors... make the norms optional on > a per field ba

Re: Document-at-a-time or Term-at-a-time searching

2005-10-10 Thread Yonik Seeley
Lucene does "docment-at-a-time", as you describe. One could make a query to do term-at-a-time in lucene if desired, as the traversal logic is encapsulated in the particular scorers for queries. > I'm looking into how easy it would be to implement a streaming search in Lucene > (where a document is

Re: how to search lucene with "great than" query

2005-10-10 Thread Yonik Seeley
Use an open-ended range query by leaving one endpoint null of a RangeQuery or RangeFilter. The current QueryParser API doesn't have syntax to support this so you have to use the Java API. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/10/05, haipeng du <[EMAIL PROTECTED]> wrote: > > I want

Re: Are Non-consecutive Document IDs feasible?

2005-10-11 Thread Yonik Seeley
Yes, lucene depends on consecutive docids. For the query side, the following thjings come to mind. - for sorting, the FieldCache allocates arrays up to maxDoc() - for deleted documents, it's a BitVector up to maxDoc() - Some queries like MatchAllDocumentsQuery do a linear scan through deleted docu

Re: Document Duplication for Multiple Segment Merge

2005-10-14 Thread Yonik Seeley
There is no concept in Lucene of document identity linked to any fields of a document. You need to handle removal of duplicates yourself. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/14/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > hi, > > When Nutch's IndexMerger.java is called, the inde

Re: Document Duplication for Multiple Segment Merge

2005-10-14 Thread Yonik Seeley
Sorry, I've only briefly looked at Nutch, so you should ask on that mailing list. Lucene doesn't do deduping. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/14/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > hi Yonik: > > Does that mean when two documents has same MD5 content > in two differe

Re: Welcome Yonik Seeley as committer!

2005-10-25 Thread Yonik Seeley
Thanks everyone, I'm glad to be on board! -Yonik On 10/24/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Last week I proposed to the Lucene PMC that we make Yonik Seeley a > committer on Lucene Java. I am pleased to announce that other PMC > members agreed. Welcome, Yonik! > > Doug >

score normalization

2005-10-27 Thread Yonik Seeley
A couple of things about score normalization... a) search(weight,filter,ndocs) does not do normalization b) search(weight,filter,ndocs,sort) does do normalization... even if the sort is by score only c) because of (b), the normalized scores given by MultiSearcher.search (weight,filter,ndocs,sort) m

updating lucene site

2005-10-27 Thread Yonik Seeley
How does one update the web site? I updated xdocs/fileformats.xml, ran "ant docs", then committed both docs/fileformats.html and xdocs/fileformats.xml Are there other steps before changes will appear? -Yonik Now hiring -- http://forms.cnet.com/slink?231706

Re: svn commit: r329366 - in /lucene/java/trunk: ./ docs/ src/java/org/apache/lucene/document/ src/java/org/apache/lucene/index/ xdocs/

2005-10-29 Thread Yonik Seeley
It's been fairly thoroughly tested by my internal search platform... there is a factory to create all fields, so simply changed the default to omit norms and verified everything still worked. Some public test cases are needed though... I'll work something up. -Yonik Now hiring -- http://forms.cne

Re: svn commit: r329366 - in /lucene/java/trunk: ./ docs/ src/java/org/apache/lucene/document/ src/java/org/apache/lucene/index/ xdocs/

2005-10-29 Thread Yonik Seeley
Definitely. Thanks for pointing that out. BTW, you probably noticed I added docs for the bits for storing term position and offset too, but the format they are stored in still isn't documented in the TermVector section. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 10/29/05, Daniel

Re: bytecount as String and prefix length

2005-11-01 Thread Yonik Seeley
Thanks for looking into this Marvin... very interesting stuff! I haven't had a chance to review it in detail, but my gut tells me that it should be able to be faster. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To u

Re: svn commit: r331111 - /lucene/java/trunk/src/test/org/apache/lucene/search/TestBoolean2.java

2005-11-07 Thread Yonik Seeley
I think a non-static would be cleaner in this case, but people would have to be more careful. Isn't it true that the old (1.4) boolean scorer can deliver docs in non-index order? That means you can't use it as a component of anything that requires in-order docs (the new scorer, or Paul's proposed

Re: compatibility of Lucene 1.9

2005-11-08 Thread Yonik Seeley
It's almost impossible to maintain 100% compatibility... I think the current level of compatibility is pretty good. I ran into a couple of minor things myself when upgrading to Lucene 1.9: - writeLogTimeout and commitLockTimeout are now final, so they can't be changed. - I have a class that ex

Re: compatibility of Lucene 1.9

2005-11-09 Thread Yonik Seeley
I think the intention has been to be as backward compatible as possible with 1.9, and that's why there should still be a 1.9 and 2.0 (removing all the deprecated stuff will break tons of things). Patch releases should strive to be a 100% drop in replacement, but that's not a realistic requirement

Re: svn commit: r331964 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/FieldCacheImpl.java

2005-11-09 Thread Yonik Seeley
ser inner classes. > +(Yonik Seeley via Otis Gospodnetic) -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: compatibility of Lucene 1.9

2005-11-09 Thread Yonik Seeley
mkdir -p lucene/java/trunk svn checkout http://svn.apache.org/repos/asf/lucene/java/trunk lucene/java/trunk cd lucene/java/trunk; ant The parser generator (JavaCC) is only needed if you change the grammar file. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/9/05, Bill Janssen <[

Re: Implementation in C & Some Questions

2005-11-11 Thread Yonik Seeley
> - Why the assumption that NO_NORMS for a field implies that the > field is not tokenized with an analyzer: The NO_NORMS object is just for convenience and meant to represent the most likely case when one would omit norms. It's less likely to mess up someone new to lucene since lengthNormalizati

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-14 Thread Yonik Seeley
I'm in favor of that... I think it's fine to eliminate most of the relevancy factors (like idf(), tf(), etc) from range and wildcard type queries. The biggest drawback is that index time boosts are disabled, but I think having queries that can work 100% of the time is more important. I'll work on

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Good point about FuzzyQuery... it has already mostly solved the "too many clauses" thing anyway. I also think the idf should go. There are two different usecases: 1) relevancy: give highest relevance and closest matches, but I don't care if I get 100% of the matches. 2) matching: must give all

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > TermQuery relies on field boost and document term frequency, so > having PrefixQuery ignore these would also lead to unexpected > surprises. The surprise from a field boost not working should be found during development. The surprise of queri

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Paul Elschot wrote: > > I think loosing the field boosts for PrefixQuery and friends would not be > > advisable. Field boosts have a very big range and from that a very big > > influence on the score and the order of the results in Hits. > > It

Re: svn commit: r332431 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/FieldDocSortedHitQueue.java src/test/org/apache/lucene/search/TestCustomSearcherSort.java

2005-11-15 Thread Yonik Seeley
It's flagged as ASF in the bug: http://issues.apache.org/jira/browse/LUCENE-456 I'll add the header. -Yonik On 11/15/05, Bernhard Messer <[EMAIL PROTECTED]> wrote: > Yonik, > > TestCustomSearcherSort.java you added a few days ago shows that the > author is Martin Seitz from T-Systems and doesn't

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
I'm not crazy about the idea of scoring changing dramatically. I think people need to be able to specify the scoring style and have it always score that way. Indicies change size and composition over time, making it difficult to predict when one would be hit with wildly different scoring (and mo

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Here's a diff to ConstantScoreQuery that optionally folds in norms (minus explain() functionality right now). Should it be added, or do the differences warrant a new Query class, or if kept together, should ConstantScoreQuery be renamed since it's not quite so constant? -Yonik Now hiring -- http:/

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Scoring recap... I think I've seen 4 different types of scoring mentioned in this thread for a term expanding query on a single field: 1) query_boost 2) query_boost * (field_boost * lengthNorm) 3) query_boost * (field_boost * lengthNorm) * tf(t in q) 4) query_boost * (field_boost * lengthNorm) * t

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
On 11/15/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > One option is to keep a score per document, That's what I was thinking... float[maxDoc[]] scores scores[doc] += tf(term) * idf(term) * norm(term.field) It would be nice to keep score compatibility with the current BooleanQuery, then that

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
> However, one problem I don't know how to solve is > Weight.sumOfSquares(), which needs to know the idf of every single > term, before the scorer is even created! Darn, even if one leaves out idf(), Weight.sumOfSquares() still depends on the number of terms in the query. I guess it's not possibl

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Yonik Seeley
Totally untested, but here is a hack at what the scorer might look like when the number of terms is large. -Yonik package org.apache.lucene.search; import org.apache.lucene.index.TermEnum; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermDocs; import java.io.IOExc

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Yonik Seeley
If that's the way to go, we should do it by default so the user doesn't have to. Unless the scores between two types of queries are compatible, It's a bad idea to transparently switch between them since it will cause relevancy to unpredictably change in the future (triggered by either a query chan

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Yonik Seeley
On 11/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > You could instead use a byte[maxDoc] and encode/decode floats as you > store and read them, to use a lot less RAM. Hmmm, very interesting idea. Less than one decimal digit of precision might be hard to swallow when you have to add scores toget

Float.floatToRawIntBits

2005-11-16 Thread Yonik Seeley
Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits

Re: Float.floatToRawIntBits

2005-11-16 Thread Yonik Seeley
t). > > http://people.apache.org/~psmith/luceneYourkit.jpg > > Mind you, the whole "signalling via IOException" in the > FastCharStream is a way bigger overhead, although I agree much harder > to fix. > > Paul Smith > > On 17/11/2005, at 7:21 AM, Yonik Seeley wrote:

  1   2   3   4   5   6   7   8   9   10   >