RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes
> I haven't read the details, but should maxBufferedDocs be exposed in > some subinterfaces instead of the MergePolicy interface? I've been wondering about this, too, but haven't come to any strong opinions (yet). I figured maybe playing with a few merge policies might make things clearer. maxBuf

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Ning Li
Hi Steven, I haven't read the details, but should maxBufferedDocs be exposed in some subinterfaces instead of the MergePolicy interface? Since some policies may use it and others may use byte size or something else. It's great that you've started on concurrent merge as well! I haven't got a chan

RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Steven Parkes
And what about Project Gutenburg? Wikipedia is going to have relatively short text, Gutenburg very long. -Original Message- From: Steven Parkes [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 2:37 PM To: java-dev@lucene.apache.org Subject: RE: [jira] Commented: (LUCENE-845) If you

RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Steven Parkes
Well, since I want to look at the impact of merge policy, I'll look into this. Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of the current English pages is 2.1G. That's certainly a lot of data. It looks like the English is about 1.8M docs. All languages is something like

[jira] Updated: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Steven Parkes (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Parkes updated LUCENE-846: - Attachment: LUCENE-846-test.txt Okay, here's an old version of 847 that demonstrates the problem

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483742 ] Steven Parkes commented on LUCENE-847: -- Visibility is one of those things I haven't cleaned up yet. Client code

[jira] Commented: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Steven Parkes (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483738 ] Steven Parkes commented on LUCENE-846: -- Dang. It used to fail but I've made some updates since then that would c

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483737 ] Doug Cutting commented on LUCENE-847: - How public should such an API be? Should the interface be public? Should

[jira] Commented: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483734 ] Michael McCandless commented on LUCENE-846: --- OK, thanks. Hurm, I found the _, removed it, but the test sti

[jira] Commented: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Steven Parkes (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483732 ] Steven Parkes commented on LUCENE-846: -- Ahh ... meant to mention that. I disabled TestIndexWriter#testAddIndexO

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-03-23 Thread Paul Elschot (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483731 ] Paul Elschot commented on LUCENE-584: - Otis: > However, I need Sort and TopFieldDocs, and I don't see a match me

[jira] Commented: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483729 ] Michael McCandless commented on LUCENE-846: --- Steve, I applied the patch from LUCENE-847 and all tests ran s

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Grant Ingersoll
Yeah, I didn't play yet with millions of documents. We will need a bigger test collection, I think! Although the benchmarker can add as many as you want from the same source, index compression will effect the results possibly more than a bigger collection with all unique docs. Maybe it is

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > I've only been loosely following this... > > Do you think it is possible to separate the stored/term vector > handling into a separate patch against the current trunk? This seems > like a quick win and I know it has been speculated about before.

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Doug Cutting
Michael McCandless wrote: Also, one caveat: whenever #docs (21578 for Reuters) divided by maxBuffered docs is less than mergeFactor, you will have no merges take place during your runs. This greatly skews the results. Also, my guess is that this index fits entirely in the buffer cache. Things

[jira] Commented: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483725 ] Michael McCandless commented on LUCENE-846: --- Hmmm, this does not sound good! I'll apply the patch and trac

[jira] Assigned: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-846: - Assignee: Michael McCandless > IOExeception can cause loss of data due to prematu

Re: NFS Question

2007-03-23 Thread Michael McCandless
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > Is it still safe to say that NFS should be avoided, if possible, for > Lucene use? I know Mike M. has been doing a lot of work to make the > support better, I'm just wondering what the recommendation is. As far as I know, if you make your own de

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-03-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483724 ] Otis Gospodnetic commented on LUCENE-584: - Paul: Applied the patch, applied cleanly, run ant test -> BUILD SU

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Michael McCandless
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > Your timing is ironic. I was just running some benchmarks for > ApacheCon (using contrib/benchmarker) and noticed what I think are > similar happenings, so maybe you can validate my assumptions. I'm > not sure if it is because I'm hitting RAM

NFS Question

2007-03-23 Thread Grant Ingersoll
Is it still safe to say that NFS should be avoided, if possible, for Lucene use? I know Mike M. has been doing a lot of work to make the support better, I'm just wondering what the recommendation is. -Grant - To unsubscribe

[jira] Updated: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Parkes updated LUCENE-847: - Attachment: LUCENE-847.txt Here's a first cut at a factored merge policy. It's not polished. Spa

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Grant Ingersoll
I've only been loosely following this... Do you think it is possible to separate the stored/term vector handling into a separate patch against the current trunk? This seems like a quick win and I know it has been speculated about before. On Mar 23, 2007, at 12:00 PM, Michael McCandless wro

[jira] Created: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes (JIRA)
Factor merge policy out of IndexWriter -- Key: LUCENE-847 URL: https://issues.apache.org/jira/browse/LUCENE-847 Project: Lucene - Java Issue Type: Improvement Reporter: Steven Parkes A

[jira] Created: (LUCENE-846) IOExeception can cause loss of data due to premature segment deletion

2007-03-23 Thread Steven Parkes (JIRA)
IOExeception can cause loss of data due to premature segment deletion - Key: LUCENE-846 URL: https://issues.apache.org/jira/browse/LUCENE-846 Project: Lucene - Java Issue Ty

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Grant Ingersoll
Hi Mike, Your timing is ironic. I was just running some benchmarks for ApacheCon (using contrib/benchmarker) and noticed what I think are similar happenings, so maybe you can validate my assumptions. I'm not sure if it is because I'm hitting RAM issues or not. Below is the algorithm fil

[jira] Commented: (LUCENE-837) contrib/benchmark QueryMaker and Task Refactorings

2007-03-23 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483682 ] Doron Cohen commented on LUCENE-837: I updated the docs to be clearer abut the records counting, and also list th

(LUCENE-835) An IndexReader with run-time support for synonyms

2007-03-23 Thread jian chen
Hi, Mark, Thanks for providing this original approach for synonyms. I read through your code and think maybe this could be extended to handle the word stemming problem as well. Here is my thought. 1) Before indexing, create a Map> stemmedWordMap, the key is the stemmed word. 1) At indexing, we

ApacheCon

2007-03-23 Thread karl wettin
Although I will not attend the actual convention (unless I figure out a way to pay for just that one seminar about Hadoop I'd really like to see) I'll be in Amsterdam during ApacheCon, in case anyone wants to meet up. -- karl

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Yes the code re-computes the level of a given segment from the current values of maxBufferedDocs & mergeFactor. But when these values have changed (or, segments were flushed by RAM not by maxBufferedDocs) then the way it computes level no

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. That is not how the current merge p

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Merging is costly because you read all data in then write all data > > out, so, you want to minimize for byte of data in the index in the > > index how many times it will be "serviced" (read i

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Merging is costly because you read all data in then write all data out, so, you want to minimize for byte of data in the index in the index how many times it will be "serviced" (read in, written out) as part of a merge. Avoiding the re-w

[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-23 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483631 ] Michael McCandless commented on LUCENE-845: --- This bug is actually rather serious. If you set maxBufferedDo

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > We say that > > developers should not rely on docIDs but people still seem to rely on > > their monotonic ordering (even though they change). > > Yes. If the benefits of removing that guarantee are large enough, we > could consider dumping it... but

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: We say that developers should not rely on docIDs but people still seem to rely on their monotonic ordering (even though they change). Yes. If the benefits of removing that guarantee are large enough, we could consider dumping it... but

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Steven Parkes
Given history, perhaps the default merge policy should conserve this, but with pluggable merge policies, I don't see that all merge policies need to. -Original Message- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 1:53 AM To: java-dev@lucene.apache.org S

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Erik Hatcher
On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote: On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote: Actually is #2 a hard requirement? A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require it.

[jira] Commented: (LUCENE-837) contrib/benchmark QueryMaker and Task Refactorings

2007-03-23 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483586 ] Grant Ingersoll commented on LUCENE-837: Thanks. Making the changes now. I moved the bytes storage up to B

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
"Chris Hostetter" <[EMAIL PROTECTED]> wrote: > : > Actually is #2 a hard requirement? > : > : A lot of Lucene users depend on having document number correspond to > : age, I think. ISTR Hatcher at least recommending techniques that > : require it. > > "Corrispond to age" may be missleading as it