Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Plus, as open source and open standard advocates, we don't want to be like Micros$ft, who claims to use industrial "standard" XML as the next generation word file format. However, it is very hard to write your own Word reader, because their word file format is proprietary and hard to write program

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data in a database for web application. You want to store it in such a way that other pro

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Chuck Williams
Could someone summarize succinctly why it is considered a major issue that Lucene uses the Java modified UTF-8 encoding within its index rather than the standard UTF-8 encoding. Is the only concern compatibility with index formats in other Lucene variants? The API to the values is a String, which

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Marvin, Thanks for your quick response. I am in the camp of fearless refactoring, even at the expense of breaking compatibility with previous releases. ;-) Compatibility aside, I am trying to identify if changing the implementation of Term is the right way to go for this problem. If it is,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Marvin Humphrey
On May 1, 2006, at 6:27 PM, jian chen wrote: This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term info merge, the utf8bytes will be loaded from the Lucene index, which already stores the term text

storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, All, Recently I have been following through the whole discussion on storing text/string as standard UTF-8 and how to achieve that in Lucene. If we are stroing the term text and the field strings as UTF-8 bytes, I now understand that it is a tricky issue because of the performance problem we

Re: SegmentReader changes?

2006-05-01 Thread Doug Cutting
Robert Engels wrote: BUT, back to the subclassing comments... Why have the runtime replaceable support then in the SegmentReader factory - there is nothing useful a subclass can do at this time, and without API changes, it will never be able to. That's all true. I don't dispute it. There's an

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels
No, not at all. I will put something together. BUT, back to the subclassing comments... Why have the runtime replaceable support then in the SegmentReader factory - there is nothing useful a subclass can do at this time, and without API changes, it will never be able to. -Original Message--

Re: SegmentReader changes?

2006-05-01 Thread Doug Cutting
If the non-public core requires a subclassible SegmentReader then SegmentReader should certainly be made subclassible. But we shouldn't make changes to improve the extensibility of the non-public API. That's a slipperly slope. The fact that you can access package-protected members by writing

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels
I can submit a patch to add the IndexReader.reopen() method. BUT, I think the requested change to SegmentReader is still valid, for the reasons cited in the previous email. There is already support for replacing the SegmentReader impl at runtime with System properties, but without the SegmentRead

Re: SegmentReader changes?

2006-05-01 Thread Doug Cutting
Robert Engels wrote: Correct - changing SegmentReader would be best, but in the past, getting proposed patches included has been slower than expected. I'm sorry if the process has been frustrating to you in the past. I hope your experiences are better in the future. So, by making the Segme

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels
Correct - changing SegmentReader would be best, but in the past, getting proposed patches included has been slower than expected. So, by making the SegmentReader more easily subclassed (which should hopefully get approved quicker), I can still have a "build" of Lucene that does not require patching

[jira] Updated: (LUCENE-557) search vs explain - score discrepancies

2006-05-01 Thread Hoss Man (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-557?page=all ] Hoss Man updated LUCENE-557: Attachment: LUCENE-557-newtests.zip Some new tests covering every type of query in the "core" lucene code base. various examples of each query type are checked both t

Re: Returning a minimum number of clusters

2006-05-01 Thread Doug Cutting
Marvin Humphrey wrote: On May 1, 2006, at 10:38 AM, Doug Cutting wrote: Nutch implements host-deduping roughly as follows: To fetch the first 10 hits it first asks for the top-scoring 20 or so. Then it uses a field cache to reduce this to just two from each host. If it runs out of raw hits,

Re: SegmentReader changes?

2006-05-01 Thread Doug Cutting
Robert Engels wrote: In implementing the 'reopen()' method SegmentReader needs to be subclassed in order to support 'refreshing' the deleted documents. Why subclass? Why not simply change SegmentReader? It's not a public class at present, and making it a public class would be a bigger change

RE: refresh segments for deleted documents?

2006-05-01 Thread Robert Engels
Thanks. I understand now. In my usage pattern deletions are never out of sync - that is why it works. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, May 01, 2006 5:36 PM To: java-dev@lucene.apache.org Subject: Re: refresh segments for deleted documents? R

Re: refresh segments for deleted documents?

2006-05-01 Thread Doug Cutting
Robert Engels wrote: Doug, can you comment on exactly why the 'deletions' need to be re-read? Doesn't seem necessary to me. A common idiom is to use one IndexReader for searches, and a separate for deletions. For example, one might do something like: 1. Open IndexReader A. 2. Start serving

[jira] Updated: (LUCENE-557) search vs explain - score discrepancies

2006-05-01 Thread Hoss Man (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-557?page=all ] Hoss Man updated LUCENE-557: Attachment: LUCENE-557-modify-existing-tests.patch Update to previous patch, with some additional helper utilities in CheckHits > search vs explain - score discrepanci

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels
I just sent an email covering that. The code I provided takes that into account, but in re-reading the code, I do not think it is necessary. -Original Message- From: jason rutherglen [mailto:[EMAIL PROTECTED] Sent: Monday, May 01, 2006 5:17 PM To: java-dev@lucene.apache.org Subject: Re: G

Re: GData, updateable IndexSearcher

2006-05-01 Thread jason rutherglen
Thanks for the code and performance metric Robert. Have you had any issues with the deleted segments as Doug has been describing? - Original Message From: Robert Engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]> Sent: Monday, May 1, 2006 11:

[jira] Updated: (LUCENE-561) ParallelReader fails on deletes and on seeks of previously unused fields

2006-05-01 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-561?page=all ] Chuck Williams updated LUCENE-561: -- Attachment: ParallelReaderBugs.patch > ParallelReader fails on deletes and on seeks of previously unused fields > --

[jira] Created: (LUCENE-561) ParallelReader fails on deletes and on seeks of previously unused fields

2006-05-01 Thread Chuck Williams (JIRA)
ParallelReader fails on deletes and on seeks of previously unused fields Key: LUCENE-561 URL: http://issues.apache.org/jira/browse/LUCENE-561 Project: Lucene - Java Type: Bug Components: Inde

Re: this == that

2006-05-01 Thread Chris Hostetter
A couple of responses to various comments in this thread... : > Unless it object identity is what is being tested or intern is an : > invariant, I think it is dangerous. It is easy to forget to intern or to : > propagate the pattern via cut and paste to an inappropriate context. interning the St

refresh segments for deleted documents?

2006-05-01 Thread Robert Engels
I implemented the IndexReader.reopen(). My original implementation did not "refresh" the deleted documents, and it seemed to work. The latest impl does re-read the deletions. BUT, on inspecting the IndexReader code, I am not sure this is necessary??? When a document is deleted, IndexReader marks

Re: this == that

2006-05-01 Thread Yonik Seeley
On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote: I am wondering if interning Strings will be really that critical for performance. Probably not as much as it was for early JVMs. The biggest bottle neck is still disk. Depends on the index and workload. Queries are often CPU-bound. -Yonik ht

Re: this == that

2006-05-01 Thread jian chen
I am wondering if interning Strings will be really that critical for performance. The biggest bottle neck is still disk. So, maybe we can use String.equals(...) instead of ==. Jian On 5/1/06, DM Smith <[EMAIL PROTECTED]> wrote: karl wettin wrote: > The code is filled with string equality code

Re: Returning a minimum number of clusters

2006-05-01 Thread Marvin Humphrey
On May 1, 2006, at 10:21 AM, Grant Ingersoll wrote: You might be interested in the Carrot project, which has some Lucene support. I don't know if it solves your second problem, but it already implements clustering and may allow you to get to an answer for the second problem quicker. I ha

Re: Returning a minimum number of clusters

2006-05-01 Thread Marvin Humphrey
On May 1, 2006, at 10:38 AM, Doug Cutting wrote: Nutch implements host-deduping roughly as follows: To fetch the first 10 hits it first asks for the top-scoring 20 or so. Then it uses a field cache to reduce this to just two from each host. If it runs out of raw hits, then it re-runs the q

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels
Attached. It uses subclasses and instanceof which is sort of "hackish" - to do it correctly requires changes to the base classes. -Original Message- From: jason rutherglen [mailto:[EMAIL PROTECTED] Sent: Monday, May 01, 2006 1:43 PM To: java-dev@lucene.apache.org Subject: Re: GData, upd

Re: GData, updateable IndexSearcher

2006-05-01 Thread jason rutherglen
Can you post your code? - Original Message From: Robert Engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]> Sent: Monday, May 1, 2006 11:33:06 AM Subject: RE: GData, updateable IndexSearcher fyi, using my reopen(0 implementation (which rereads

[jira] Updated: (LUCENE-560) NPE in SpanNear when used as exclusion for SpanNot

2006-05-01 Thread Hoss Man (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-560?page=all ] Hoss Man updated LUCENE-560: Attachment: LUCENE-560-test.patch revised patch showing same bug when exclude is a SpanFirst containing a SpanNear. > NPE in SpanNear when used as exclusion for SpanN

Lucene 2.0 similarity scoring formula

2006-05-01 Thread Charlie
Hello, Will 2.0's similarity scoring formula remain the same as the following? \sum_{t in q} tf(t in d) * idf(t) * boost(t.field in d) * lengthNorm(t.field in d) or what exactly it will be? -- Best regards, Charlie - To un

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels
fyi, using my reopen(0 implementation (which rereads the deletions) on a 135mb index, with 5000 iterations open & close time using new reader = 585609 open & close time using reopen = 27422 Almost 20x faster. Important in a highly interactive/incremental updating index. -Original Message---

Re: GData, updateable IndexSearcher

2006-05-01 Thread jason rutherglen
I wanted to post a quick hack to see if it is along the correct lines. A few of the questions regard whether to resuse existing MultiReaders or simply strip out only the SegmentReaders. I do a compare on the segment name and made it public. Thanks! public static IndexReader reopen(IndexRead

Re: A problem running two or more negative Span clauses

2006-05-01 Thread Chris Hostetter
: I am having problems running span queries with more than one : negative clauses: i believe you mean when the exclude clause contains a SpanNear query correct? : Is the span query nested correctly? I'm not very good at reading SpanQuery.toString() output ... but i believe i encountered the s

[jira] Updated: (LUCENE-560) NPE in SpanNear when used as exclusion for SpanNot

2006-05-01 Thread Hoss Man (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-560?page=all ] Hoss Man updated LUCENE-560: Attachment: LUCENE-560-test.patch patch to TestBasics.java demonstrating bug > NPE in SpanNear when used as exclusion for SpanNot > ---

[jira] Created: (LUCENE-560) NPE in SpanNear when used as exclusion for SpanNot

2006-05-01 Thread Hoss Man (JIRA)
NPE in SpanNear when used as exclusion for SpanNot -- Key: LUCENE-560 URL: http://issues.apache.org/jira/browse/LUCENE-560 Project: Lucene - Java Type: Bug Components: Search Reporter: Hoss Man Attachments: LU

Re: this == that

2006-05-01 Thread DM Smith
karl wettin wrote: The code is filled with string equality code using == rather than equals(). I honestly don't think it saves a single clock tick as the JIT takes care of it when the first line of code in the equals method is if (this == that) return true; If the strings are intern() then it

[jira] Commented: (LUCENE-525) A standard Lucene install that works for simple web sites

2006-05-01 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-525?page=comments#action_12377252 ] Otis Gospodnetic commented on LUCENE-525: - Of course. > A standard Lucene install that works for simple web sites > ---

Re: Returning a minimum number of clusters

2006-05-01 Thread Doug Cutting
Marvin Humphrey wrote: The problem I'm trying to solve is how to return a minimum number of clusters from a search. Say the most relevant 100 documents for a query are all from the same domain, but you want a maximum of two results per domain, a la Google. I don't see any alternative to

Re: Returning a minimum number of clusters

2006-05-01 Thread Grant Ingersoll
You might be interested in the Carrot project, which has some Lucene support. I don't know if it solves your second problem, but it already implements clustering and may allow you to get to an answer for the second problem quicker. I have, just recently, started using it for a clustering task

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-05-01 Thread robert engels (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377243 ] robert engels commented on LUCENE-436: -- The finalize() method should be removed from the Segment/Term readers. In most cases this will actually make things worse, as the

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-05-01 Thread robert engels (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377242 ] robert engels commented on LUCENE-436: -- The bug is invalid, as was pointed out on the lucene-dev list. When the thread dies, all thread locals will be cleaned-up. If the

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-05-01 Thread Nicholaus Shupe (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377239 ] Nicholaus Shupe commented on LUCENE-529: This bug seems to be a dupe of http://issues.apache.org/jira/browse/LUCENE-436, invalid or not. > TermInfosReader and other +

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-05-01 Thread Nicholaus Shupe (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377238 ] Nicholaus Shupe commented on LUCENE-436: I will be trying this patch on Lucene 1.9.1 to see if it fixes my production memory leak with Tomcat 5.5.15 / Redhat / Java 1.

Returning a minimum number of clusters

2006-05-01 Thread Marvin Humphrey
Greets, I'm toying with the idea of implementing clustering of search results based on comparison of document vectors constrained by field. For instance, you could cluster based on "topic", or "domain", or "content". "domain" would be easy, as it's presumably a single value field. "con

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-05-01 Thread robert engels (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377228 ] robert engels commented on LUCENE-529: -- As was pointed out in the email threads related to this, the submitters test cases are incorrect. This bug shoul dbe closed as in

[jira] Commented: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-05-01 Thread Nicholaus Shupe (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-529?page=comments#action_12377225 ] Nicholaus Shupe commented on LUCENE-529: I'm experiencing a memory leak in Lucene 1.9.1 that might be related to this problem. Looks like I'll be creating my own patc

Re: this == that

2006-05-01 Thread karl wettin
30 apr 2006 kl. 04.48 skrev Tatu Saloranta: JIT takes care of it when the first line of code in the equals method is if (this == that) return true; In case where (this == that) is true, this may well be correct, but: Please correct me if I'm wrong. ... you are then assuming 100% match ra