> I haven't read the details, but should maxBufferedDocs be exposed in
> some subinterfaces instead of the MergePolicy interface?
I've been wondering about this, too, but haven't come to any strong
opinions (yet). I figured maybe playing with a few merge policies might
make things clearer.
maxBuf
Hi Steven,
I haven't read the details, but should maxBufferedDocs be exposed in
some subinterfaces instead of the MergePolicy interface? Since some
policies may use it and others may use byte size or something else.
It's great that you've started on concurrent merge as well! I haven't
got a chan
And what about Project Gutenburg?
Wikipedia is going to have relatively short text, Gutenburg very long.
-Original Message-
From: Steven Parkes [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 2:37 PM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Commented: (LUCENE-845) If you
Well, since I want to look at the impact of merge policy, I'll look into
this.
Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs. All languages is something
like
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Parkes updated LUCENE-846:
-
Attachment: LUCENE-846-test.txt
Okay, here's an old version of 847 that demonstrates the problem
[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483742
]
Steven Parkes commented on LUCENE-847:
--
Visibility is one of those things I haven't cleaned up yet.
Client code
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483738
]
Steven Parkes commented on LUCENE-846:
--
Dang. It used to fail but I've made some updates since then that would c
[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483737
]
Doug Cutting commented on LUCENE-847:
-
How public should such an API be? Should the interface be public? Should
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483734
]
Michael McCandless commented on LUCENE-846:
---
OK, thanks.
Hurm, I found the _, removed it, but the test sti
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483732
]
Steven Parkes commented on LUCENE-846:
--
Ahh ... meant to mention that. I disabled
TestIndexWriter#testAddIndexO
[
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483731
]
Paul Elschot commented on LUCENE-584:
-
Otis:
> However, I need Sort and TopFieldDocs, and I don't see a match me
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483729
]
Michael McCandless commented on LUCENE-846:
---
Steve, I applied the patch from LUCENE-847 and all tests ran s
Yeah, I didn't play yet with millions of documents. We will need a
bigger test collection, I think! Although the benchmarker can add as
many as you want from the same source, index compression will effect
the results possibly more than a bigger collection with all unique docs.
Maybe it is
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> I've only been loosely following this...
>
> Do you think it is possible to separate the stored/term vector
> handling into a separate patch against the current trunk? This seems
> like a quick win and I know it has been speculated about before.
Michael McCandless wrote:
Also, one caveat: whenever #docs (21578 for Reuters) divided by
maxBuffered docs is less than mergeFactor, you will have no merges
take place during your runs. This greatly skews the results.
Also, my guess is that this index fits entirely in the buffer cache.
Things
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483725
]
Michael McCandless commented on LUCENE-846:
---
Hmmm, this does not sound good! I'll apply the patch and trac
[
https://issues.apache.org/jira/browse/LUCENE-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless reassigned LUCENE-846:
-
Assignee: Michael McCandless
> IOExeception can cause loss of data due to prematu
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> Is it still safe to say that NFS should be avoided, if possible, for
> Lucene use? I know Mike M. has been doing a lot of work to make the
> support better, I'm just wondering what the recommendation is.
As far as I know, if you make your own de
[
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483724
]
Otis Gospodnetic commented on LUCENE-584:
-
Paul:
Applied the patch, applied cleanly, run ant test -> BUILD SU
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> Your timing is ironic. I was just running some benchmarks for
> ApacheCon (using contrib/benchmarker) and noticed what I think are
> similar happenings, so maybe you can validate my assumptions. I'm
> not sure if it is because I'm hitting RAM
Is it still safe to say that NFS should be avoided, if possible, for
Lucene use? I know Mike M. has been doing a lot of work to make the
support better, I'm just wondering what the recommendation is.
-Grant
-
To unsubscribe
[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Parkes updated LUCENE-847:
-
Attachment: LUCENE-847.txt
Here's a first cut at a factored merge policy.
It's not polished. Spa
I've only been loosely following this...
Do you think it is possible to separate the stored/term vector
handling into a separate patch against the current trunk? This seems
like a quick win and I know it has been speculated about before.
On Mar 23, 2007, at 12:00 PM, Michael McCandless wro
Factor merge policy out of IndexWriter
--
Key: LUCENE-847
URL: https://issues.apache.org/jira/browse/LUCENE-847
Project: Lucene - Java
Issue Type: Improvement
Reporter: Steven Parkes
A
IOExeception can cause loss of data due to premature segment deletion
-
Key: LUCENE-846
URL: https://issues.apache.org/jira/browse/LUCENE-846
Project: Lucene - Java
Issue Ty
Hi Mike,
Your timing is ironic. I was just running some benchmarks for
ApacheCon (using contrib/benchmarker) and noticed what I think are
similar happenings, so maybe you can validate my assumptions. I'm
not sure if it is because I'm hitting RAM issues or not.
Below is the algorithm fil
[
https://issues.apache.org/jira/browse/LUCENE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483682
]
Doron Cohen commented on LUCENE-837:
I updated the docs to be clearer abut the records counting, and also list th
Hi, Mark,
Thanks for providing this original approach for synonyms. I read through
your code and think maybe this could be extended to handle the word stemming
problem as well.
Here is my thought.
1) Before indexing, create a Map> stemmedWordMap,
the key is the stemmed word.
1) At indexing, we
Although I will not attend the actual convention (unless I figure out
a way to pay for just that one seminar about Hadoop I'd really like
to see) I'll be in Amsterdam during ApacheCon, in case anyone wants
to meet up.
--
karl
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor. But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.
That is not how the current merge p
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read i
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.
Avoiding the re-w
[
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483631
]
Michael McCandless commented on LUCENE-845:
---
This bug is actually rather serious.
If you set maxBufferedDo
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > We say that
> > developers should not rely on docIDs but people still seem to rely on
> > their monotonic ordering (even though they change).
>
> Yes. If the benefits of removing that guarantee are large enough, we
> could consider dumping it... but
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).
Yes. If the benefits of removing that guarantee are large enough, we
could consider dumping it... but
Given history, perhaps the default merge policy should conserve this,
but with pluggable merge policies, I don't see that all merge policies
need to.
-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 1:53 AM
To: java-dev@lucene.apache.org
S
On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote:
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond
to age, I think. ISTR Hatcher at least recommending techniques
that require it.
[
https://issues.apache.org/jira/browse/LUCENE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483586
]
Grant Ingersoll commented on LUCENE-837:
Thanks. Making the changes now. I moved the bytes storage up to
B
"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think. ISTR Hatcher at least recommending techniques that
> : require it.
>
> "Corrispond to age" may be missleading as it
40 matches
Mail list logo