Re: lucene and solr trunk

2010-03-17 Thread Ian Holsman
what other libraries do is have a 'core' or a 'common' bit.. which is 
what the lucene library really is.


looking at http://svn.apache.org/repos/asf/lucene/ today I see that 
nearly, but it's called 'java'.
maybe just renaming 'java' to 'core' or 'common' (hadoop uses common) 
might make sense

and let ivy or maven be responsible for pulling the other parts.

as a weekend developer, I would just pull the bit I care about, and let 
ivy or maven get the other bits for me.


btw.. having a master 'pom.xml' in 
http://svn.apache.org/repos/asf/lucene/ could just include the the 
module pom's and build them

without having to have nightly jars etc.

as for the goal of doing single commits, I've noticed that most of the 
discussion has been in the format of


/lucene/XYZ/trunk/...
and /lucene/ABC/trunk

if this is one code base, would it make sense to have it:
/lucene/trunk/ABC
/lucene/trunk/XYZ

?
On 3/18/10 11:33 AM, Chris Hostetter wrote:

: build and nicely gets all dependencies to Lucene and Tika whenever I build
: or release, no problem there and certainly no need to have it merged into
: Lucene's svn!

The key distinction is that Solr is allready in "Lucene's svn" -- The
question is how reorg things in a way that makes it easier to build Solr
and Lucene-Java all at once, while wtill making it easy to build just
Lucene-Java.

: Professionally i work on a (world-class) geocoder that also nicely depends
: on Lucene by using maven, no problems there at all and no need to merge
: that code in Lucene's svn!

Unless maven has some features i'm not aware of, your "nicely depends"
works buy pulling Lucene jars from a repository -- changing Solr to do
that (instead of having committed jars) would be farrly simple (with or
w/o maven), but that's not the goal.  The goal is to make it easy to build
both at once, have patches that update both, and (make it easy to) have
atomic svn commits that touch both.


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   




[jira] Created: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-17 Thread Michael Busch (JIRA)
Use parallel arrays instead of PostingList objects
--

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.

In order to avoid having very many long-living PostingList objects in 
TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
simply be a int[] which maps each term to dense termIDs.

All data that the PostingList classes currently hold will then we placed in 
parallel arrays, where the termID is the index into the arrays.  This will 
avoid the need for object pooling, will remove the overhead of object 
initialization and garbage collection.  Especially garbage collection should 
benefit significantly when the JVM runs out of memory, because in such a 
situation the gc mark times can get very long if there is a big number of 
long-living objects in memory.

Another benefit could be to build more efficient TermVectors.  We could avoid 
the need of having to store the term string per document in the TermVector.  
Instead we could just store the segment-wide termIDs.  This would reduce the 
size and also make it easier to implement efficient algorithms that use 
TermVectors, because no term mapping across documents in a segment would be 
necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: #lucene IRC log [was: RE: lucene and solr trunk]

2010-03-17 Thread Ian Holsman

+1

I'd like to see the IRC logs added to things like 
http://search-lucene.com/ and 
http://www.lucidimagination.com/search/?q=IRC&Search=Search 



while it might not be great for decision making.. it is amazing for 
helping debug common problems people have


On 3/17/10 7:10 AM, Chris Hostetter wrote:

: with, "if id didn't happen on the lists, it didn't happen". Its the same as

+1

But as the IRC channel gets used more and more, it would *also* be nice if
there was an archive of the IRC channel so that there is a place to go
look to understand the back story behind an idea once it's synthesized and
posted to the lists/jira.

That's the huge advantage IRC has over informal conversations at
hackathons, apachecon, and meetups -- there can in fact be easily
archivable/parsable/searchable records of the communication.



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   




[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846752#action_12846752
 ] 

Robert Muir commented on LUCENE-2323:
-

bq. But I don't think we're talking about massive amount of code here right?

And hopefully less code if we can put some of these things together and
start looking at refactoring them a bit!

Until code in contrib is to a certain degree of maturity, I feel we should 
organize
it by functionality. Its easy for the users, and it invites the sort of 
refactoring and 
cleanup that some of this code needs.


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846744#action_12846744
 ] 

Shai Erera commented on LUCENE-2323:


Robert - I think that's exactly what I was proposing. You indicated that there 
are some components under contrib that you cannot move around because their 
package names would change, and I said that I don't think their package names 
should change :). So you can move XMLQP under contrib/queryparsers and its 
package name will stay the same ...

I also think that at least for analyzers, having all of them under one contrib 
jar will allow us to improve the way our users interact w/ analyzers. Consider 
for example an AnalyzerFactory which when receiving a Locale returns a 
pre-configured Analyzer, the best one we can think of for that Locale. That's 
(to me) a great service to our users, and I don't see how we can do that if all 
analyzers are under different modules. Besides, analyzers are logically close 
to each other because they perform a very specific task. Refactoring the 
analysis API again would be easier of all of them were under the same root 
directory ... less chance of missing some.

Query parsers are different because I agree a user will likely pick one for his 
app and go with it. But I don't think we're talking about massive amount of 
code here right? So again it makes sense to bundle them up together. We can 
have a module-level documentation of the different query parsers, pros and cons 
for each, use cases etc., and then the user can pick. If jar size is important 
to someone, then I think that someone is already recompiling everything to 
include just what he needs, and so we're not hurting anyone here.

Therefore I see this reorg as a logical and important step towards 
modularization.

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-17 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reopened LUCENE-2326:
-


This use of svn:externals causes a problem for snowball, it does not always 
fetch the correct revision

{quote}
[junit] Testcase: 
testStemmers(org.apache.lucene.analysis.snowball.TestSnowballVocab): FAILED
[junit] term 0 expected: but was:
[junit] junit.framework.ComparisonFailure: term 0 expected: 
but was:
{quote}

This is sporatic and not easy to reproduce.

You can clearly see that its fetching the wrong revision by looking at:
http://svn.tartarus.org/snowball/trunk/data/german/output.txt?r1=432&r2=527

Where rev 527 expects "amtsgeheimnis", but rev 500 should expect 
"amtsgeheimniss"


> Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
> branch and linking snowball tests by svn:externals
> ---
>
> Key: LUCENE-2326
> URL: https://issues.apache.org/jira/browse/LUCENE-2326
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch, 3.1
>
> Attachments: LUCENE-2326.patch, LUCENE-2326.patch
>
>
> As we often need to update backwards tests together with trunk and always 
> have to update the branch first, record rev no, and update build xml, I would 
> simply like to do a svn copy/move of the backwards branch.
> After a release, this is simply also done:
> {code}
> svn rm backwards
> svn cp releasebranch backwards
> {code}
> By this we can simply commit in one pass, create patches in one pass.
> The snowball tests are currently downloaded by svn.exe, too. These need a 
> fixed version for checkout. I would like to change this to use svn:externals. 
> Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846719#action_12846719
 ] 

Robert Muir commented on LUCENE-2323:
-

bq. it could be that I thought it was a really great idea at the time

my problem is not in the idea, but that its just unrealistic.

more realistic, lets say for the queryparsers would be, for example:
# moving the queryparsers together as i proposed here
# implementing some of the specialized ones with the new flexible queryparser 
(LUCENE-1823)
# removing the now obselete specialized queryparsers.
# improving tests and general quality of the queryparsing package

At this point the code might be mature enough for an idea like yours.

I'm also realistic, and I know I probably cannot do much here except step 1, as 
I'm not a queryparser expert.

But I can say there's at least a patch open for step 2, even if this patch 
might not yet be ready.
So this seems like a realistic small step forward towards improving things.

The modularization idea won't cleanup contrib... some of it is hairy and that 
needs to be done first.


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846712#action_12846712
 ] 

Hoss Man commented on LUCENE-2323:
--

bq. I didn't know this was the goal, if what you say is true, then I must say i 
completely misunderstood, I completely disagree, and I'm completely off-base 
with this issue.

I'm not saying it is a goal, or should be a goal, just that i seem to remember 
that this was teh direction that seemed to have support the last time i 
remember there being a big "reorg the contribs" discussion.  (i could be 
remembering wrong, it could be that *I* thought it was a really great idea at 
the time so it stuck with me, and now i'm just more ambivalent)  

A quick skim suggests this is the most recent thread i'm thinking of...

http://old.nabble.com/New-flexible-query-parser-to22549684.html#a22637326
("kitchen sink" was the search term i was looking for)

...but i don't think that was the first time it came up.

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846711#action_12846711
 ] 

Mark Miller commented on LUCENE-2323:
-

This reorg is a great a great step for contrib IMO!

+1

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846709#action_12846709
 ] 

Robert Muir commented on LUCENE-2323:
-

{quote}
agreed ... IIRC the idea in this discussion was the have a lot more smaller 
"modules", with a lot better defined/advertised dependencies, so that module 
X,Y,Z might all depend on modules A, and B (which had the common refactored 
code you speak of)
{quote}

I didn't know this was the goal, if what you say is true, then I must say i 
completely
misunderstood, I completely disagree, and I'm completely off-base with this 
issue.


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846707#action_12846707
 ] 

Hoss Man commented on LUCENE-2323:
--

bq. Perhaps I want to refactor some code among our 7 queryparsers or 2 
highlighters or whatever, the only way I can do this is to shove stuff (shared 
code) into core, I think this is bad.

agreed ... IIRC the idea in this discussion was the have a lot more smaller 
"modules", with a lot better defined/advertised dependencies, so that module 
X,Y,Z might all depend on modules A, and B (which had the common refactored 
code you speak of) and the "core" module is special in that it must never 
depend on anything else.

Like i said: I personally don't have a very strong opinion about this, i think 
people who are really concerned about jar sizes can compile their own after 
pruning the classes they don't care about -- but it's definitely harder when 
those classes are all in one atomic source tree where you might not notice that 
someone refactored a common dependency that wasn't there before.


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846702#action_12846702
 ] 

Robert Muir commented on LUCENE-2323:
-

Hoss Man, the only problem I have with what you said, is that it prevents 
factoring the code.

Perhaps I want to refactor some code among our 7 queryparsers or 2 highlighters 
or whatever, 
the only way I can do this is to shove stuff (shared code) into core, I think 
this is bad.

Otherwise, I don't really care how things are packaged, this proposal was 
supposed to be
a small step towards modules.



> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846700#action_12846700
 ] 

Hoss Man commented on LUCENE-2323:
--

I personally don't have a strong opinion on this, but i wanted to point it out 
for completeness:

the last time i remember a big discussion about reorging contribs, there seemed 
to be a strong sentiment that we should be striving for more "small" 
contribs/modules -- specificly in terms of artifact size/complexity.  I think 
one specific example was that some poeple might want a few langauge specific 
analyzers, but not all of them -- and if they have no direct dependencies on 
each other (just core) we should try to build/distribute them as (tiny) 
individual Jars -- and possible in (big) bundled jars as well.

So while it might make a lot of sense to organize some existing contribs into 
logical "groups" which might get build up in big bundled jars, there are likely 
going to be people who still want to comsume the existing jars (or even more 
granular jars)

Looking at the specific suggestions robert made: it makes sense to logically 
organize all the query parsers under a common directory, but how many users are 
actually using more then one and are we doing them a disservice if we only ship 
them in one big jar?   Ditto for the highlighters (does anyone besides Solr use 
*both* highlighters in a single application?)

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-17 Thread Chris Hostetter
: build and nicely gets all dependencies to Lucene and Tika whenever I build
: or release, no problem there and certainly no need to have it merged into
: Lucene's svn!

The key distinction is that Solr is allready in "Lucene's svn" -- The 
question is how reorg things in a way that makes it easier to build Solr 
and Lucene-Java all at once, while wtill making it easy to build just 
Lucene-Java.

: Professionally i work on a (world-class) geocoder that also nicely depends
: on Lucene by using maven, no problems there at all and no need to merge
: that code in Lucene's svn!

Unless maven has some features i'm not aware of, your "nicely depends" 
works buy pulling Lucene jars from a repository -- changing Solr to do 
that (instead of having committed jars) would be farrly simple (with or 
w/o maven), but that's not the goal.  The goal is to make it easy to build 
both at once, have patches that update both, and (make it easy to) have 
atomic svn commits that touch both.


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846665#action_12846665
 ] 

Michael McCandless commented on LUCENE-2320:


Shai this patch looks good -- thanks!  Somehow you keep getting yourself sucked 
into the issues that need big patches to fix

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846622#action_12846622
 ] 

Mark Miller commented on LUCENE-2320:
-

+1 - I've had to do this in the past too. Just dropping tests doesn't seem like 
the way to go in many cases.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846621#action_12846621
 ] 

Michael McCandless commented on LUCENE-2320:


I think it's OK to add stubs to src/* under backwards branch, in cases like 
this?  Ie when an experimental API is changed.

Just removing the tests that use the affected API isn't really an option here 
-- eg some tests explicitly set up a LogDocMergePolicy (not sure exactly why) 
and we in general can't just remove that.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846620#action_12846620
 ] 

Robert Muir commented on LUCENE-2323:
-

bq. I do want to propose to omit the component name from the package (where it 
makes sense).

Shai, I agree its a little redundant, yet under this issue I wanted to avoid 
changing package names 
as much as possible: changing the package name breaks people's code and thats 
why I wanted
to just do the first part, with no pkg naming changes.

I thought these initial svn moves are obvious wins, and any further stuff can 
be done later under another
issue... 

Typically I would prefer to do a full reorganization at once, but in my opinion 
that is for a later, probably
longer and more frustrating discussion, and it needs to involve Solr, too.


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846619#action_12846619
 ] 

Uwe Schindler commented on LUCENE-2320:
---

In that case just remove the test in backwards. If you just replicate it in the 
same way like in trunk, it does not make sense.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846615#action_12846615
 ] 

Shai Erera commented on LUCENE-2320:


Uwe, I'm pretty familiar w/ how backwards goes .. I've had a lot of bw breaks 
in my contributions history :). This patch + issue removes the MP ctor which 
accepts IW and exposes the default ctor only. That's a bw break, which is 
documented in CHANGES as well as was agreed on here because MP is experimental 
and gives us the freedom to do that (not that it's such a drastic change). 
Therefore I had to update the src/java section of bw, so that its tests would 
compile against MPs that expose the default ctor, and not the one accepting IW.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846614#action_12846614
 ] 

Uwe Schindler commented on LUCENE-2320:
---

Its normally not the idea of backwards tests to change the src/java part of the 
backwards part, as this would hide a backwards break. You should only change 
only src/tests in backwards! src/java is only for compiling a JAR file of the 
old lucene version, if you change it, you test against the wrong classes!

Uwe

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2320:
---

Attachment: LUCENE-2320.patch

Sorry ... I generated the patch on the wrong backwards folder (the one before 
Uwe's changes) :). I hope this time it's ok ...

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
> LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Kay Kay (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846606#action_12846606
 ] 

Kay Kay commented on LUCENE-2323:
-

When we talk about reorganization - it would be useful to run by some jdepend 
reports at -  http://clarkware.com/software/JDepend.html , as a metric for the 
stability of the packages. 

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-17 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846604#action_12846604
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Previously there was a discussion about DW index readers that
stay open, but could refer to byte arrays that are
recycled? Can't we simply throw away the doc writer after a
successful segment flush (the IRs would refer to it, however
once they're closed, the DW would close as well)? Then start
with a new DW for the next batch of indexing for that thread?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846602#action_12846602
 ] 

Shai Erera commented on LUCENE-2323:


Robert - I think that's great reorganization.

I do want to propose to omit the component name from the package (where it 
makes sense). I.e., unless we want to have o.a.l.qp, it doesn't mean that all 
QPs under contrib/queryParser need to belong to the same package. If it makes 
sense for all of them to belong to o.a.l.search (as an example), then that's 
where they should go, IMO.

So I think it's ok if under contrib/queryparser or contrib/analyzers you'll see 
packages like o.a.l.analysis.ar/fr/snowball, as well as I'm perfectly fine if 
all of them exist under o.a.l.analysis. Analysis makes sense as a package name.

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2320:
---

Attachment: LUCENE-2320.patch

Updating to the latest revision. This should be ok now.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-17 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: lucene-2324-no-pooling.patch

All tests pass but I have to review if with the changes the memory consumption 
calculation still works correctly. Not sure if the junits test that?

Also haven't done any performance testing yet.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324-no-pooling.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-17 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846586#action_12846586
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. Michael, Agreed, can you outline how you think we should proceed then?

Sorry for not responding earlier...

I'm currently working on removing the PostingList object pooling, because it 
makes TermsHash and TermsHashPerThread much easier.  Have written the patch and 
all tests pass, though I haven't done performance testing yet.  Making 
TermsHash and TermsHashPerThread smaller will also make the patch here easier 
which will remove them. I'll post the patch soon. 

Next steps I think here are to make everything downstream of DocumentsWriter 
single-threaded (removal of *PerThread) classes.  Then we need to write the 
DocumentsWriterThreadBinder and have to think about how to apply deletes, 
commits and rollbacks to all DocumentsWriter instances.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2320:
---

Attachment: LUCENE-2320.patch

Attached patch w/ removing the IW-related ctors from MPs, as well as fixing 
backwards. All tests pass, including javadocs

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch, LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.synced field accumulates data

2010-03-17 Thread Gregor Kaczor
followup in

https://issues.apache.org/jira/browse/LUCENE-2328


 Original-Nachricht 
> Datum: Wed, 17 Mar 2010 14:30:25 -0500
> Von: Michael McCandless 
> An: java-dev@lucene.apache.org
> Betreff: Re: IndexWriter.synced field accumulates data

> You're right!
> 
> Really we should delete from sync'd when we delete the files.  We need
> to tie into IndexFileDeleter for that, maybe moving this set into
> there.
> 
> Though in practice the amount of actual RAM used should rarely be an
> issue?  But we should fix it...
> 
> Can you open an issue?
> 
> Mike
> 
> On Wed, Mar 17, 2010 at 1:15 PM, Gregor Kaczor  wrote:
> > I am running into a strange OutOfMemoryError. My small test application
> does index and delete some few files. This is repeated for 60k times.
>  Optimization is run from every 2k times a file is indexed. Index size is 
> 50KB.
> I did analyze the HeapDumpFile and realized that IndexWriter.synced field
> occupied more than half of the heap. That field is a private HashSet
> without a getter. Its task is to hold files which have been synced already.
> >
> > There are two calls to addAll and one call to add on synced but no
> remove or clear throughout the lifecycle of the IndexWriter instance.
> >
> > According to the Eclipse Memory Analyzer synced contains 32618 entries
> which look like file names "_e065_1.del" or "_e067.cfs"
> >
> > The index directory contains 10 files only.
> >
> > I guess synced is holding obsolete data
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-17 Thread Gregor Kaczor (JIRA)
IndexWriter.synced  field accumulates data leading to a Memory Leak
---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1, 3.0, 2.9.2, 2.9.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor


I am running into a strange OutOfMemoryError. My small test application does
index and delete some few files. This is repeated for 60k times. Optimization
is run from every 2k times a file is indexed. Index size is 50KB. I did analyze
the HeapDumpFile and realized that IndexWriter.synced field occupied more than
half of the heap. That field is a private HashSet without a getter. Its task is
to hold files which have been synced already.

There are two calls to addAll and one call to add on synced but no remove or
clear throughout the lifecycle of the IndexWriter instance.

According to the Eclipse Memory Analyzer synced contains 32618 entries which
look like file names "_e065_1.del" or "_e067.cfs"

The index directory contains 10 files only.

I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.synced field accumulates data

2010-03-17 Thread Gregor Kaczor
I will open an issue.

Acually its not the size of occupied RAM. The leak is the problem.

 Original-Nachricht 
> Datum: Wed, 17 Mar 2010 14:30:25 -0500
> Von: Michael McCandless 
> An: java-dev@lucene.apache.org
> Betreff: Re: IndexWriter.synced field accumulates data

> You're right!
> 
> Really we should delete from sync'd when we delete the files.  We need
> to tie into IndexFileDeleter for that, maybe moving this set into
> there.
> 
> Though in practice the amount of actual RAM used should rarely be an
> issue?  But we should fix it...
> 
> Can you open an issue?
> 
> Mike
> 
> On Wed, Mar 17, 2010 at 1:15 PM, Gregor Kaczor  wrote:
> > I am running into a strange OutOfMemoryError. My small test application
> does index and delete some few files. This is repeated for 60k times.
>  Optimization is run from every 2k times a file is indexed. Index size is 
> 50KB.
> I did analyze the HeapDumpFile and realized that IndexWriter.synced field
> occupied more than half of the heap. That field is a private HashSet
> without a getter. Its task is to hold files which have been synced already.
> >
> > There are two calls to addAll and one call to add on synced but no
> remove or clear throughout the lifecycle of the IndexWriter instance.
> >
> > According to the Eclipse Memory Analyzer synced contains 32618 entries
> which look like file names "_e065_1.del" or "_e067.cfs"
> >
> > The index directory contains 10 files only.
> >
> > I guess synced is holding obsolete data
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846546#action_12846546
 ] 

Michael McCandless commented on LUCENE-2312:


intUptoStart is used in THPF.writeByte which is very much a hotspot when 
indexing, so I added it as a direct member in THPF to avoid an extra deref 
through the intPool.  Could be this is harmless in practice though...

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.synced field accumulates data

2010-03-17 Thread Michael McCandless
You're right!

Really we should delete from sync'd when we delete the files.  We need
to tie into IndexFileDeleter for that, maybe moving this set into
there.

Though in practice the amount of actual RAM used should rarely be an
issue?  But we should fix it...

Can you open an issue?

Mike

On Wed, Mar 17, 2010 at 1:15 PM, Gregor Kaczor  wrote:
> I am running into a strange OutOfMemoryError. My small test application does 
> index and delete some few files. This is repeated for 60k times.  
> Optimization is run from every 2k times a file is indexed. Index size is 
> 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced 
> field occupied more than half of the heap. That field is a private HashSet 
> without a getter. Its task is to hold files which have been synced already.
>
> There are two calls to addAll and one call to add on synced but no remove or 
> clear throughout the lifecycle of the IndexWriter instance.
>
> According to the Eclipse Memory Analyzer synced contains 32618 entries which 
> look like file names "_e065_1.del" or "_e067.cfs"
>
> The index directory contains 10 files only.
>
> I guess synced is holding obsolete data
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846543#action_12846543
 ] 

Michael McCandless commented on LUCENE-2323:


bq. Here are my initial thoughts on this. 

+1, I think this initial re-org is great Robert!

I think it'd be OK to rename XML QP and Wikipedia as well.  Surround does seem 
trickier... maybe leave that for now.

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846542#action_12846542
 ] 

Mark Miller commented on LUCENE-2305:
-

Ah, yes - I didnt remember your comment right:

{quote}
We could make the change under Version?  (Change to true, starting in 3.1).

Or maybe not make the change.  If set to true, we use pct deletion on
a segment to reduce its perceived size when selecting merges, which
generally causes segments with pending deletions to be merged away
sooner
{quote}

Sounds like a good move.

> Introduce Version in more places long before 4.0
> 
>
> Key: LUCENE-2305
> URL: https://issues.apache.org/jira/browse/LUCENE-2305
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> We need to introduce Version in as many places as we can (wherever it makes 
> sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
> out. That way, we can have a bunch of deprecated API now, that will be gone 
> in 4.0, rather than doing it one class at a time and never finish :).
> The purpose is to introduce Version wherever it is mandatory now, and also in 
> places where we think it might be useful in the future (like most of our 
> Analyzers, configured classes and configuration classes).
> I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
> think it will be done one step at a time, perhaps for cluster of classes 
> together. But on the other hand I don't want to mark it for 4.0.0 because 
> that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
> mark it for 3.9. We can do several commits in one issue right? So this one 
> can live for a while in JIRA, while we gradually convert more and more 
> classes.
> The first candidate is InstantiatedIndexWriter which probably should take an 
> IndexWriterConfig. While I converted the code to use IWC, I've noticed 
> Instantiated defaults its maxFieldLength to the current default (10,000) 
> which is deprecated. I couldn't change it for back-compat reasons. But we can 
> upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
> 3.1, otherwise stay w/ the deprecated default.
> if it's acceptable to have several commits in one issue, I can start w/ 
> Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-17 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846540#action_12846540
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

I think the DW index reader needs to create a new fields reader on demand if 
the field infos have changed since the last field reader instantiation.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846537#action_12846537
 ] 

Michael McCandless commented on LUCENE-2305:


Hmm... I think true is likely the better default (since it will tend, more, to 
merge segments that have many deletes)?

I had said leave it as false because we missed this TODO in 3.0.

But... if we add Version to MP (I think that makes sense) then I think we 
should flip the default?

> Introduce Version in more places long before 4.0
> 
>
> Key: LUCENE-2305
> URL: https://issues.apache.org/jira/browse/LUCENE-2305
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> We need to introduce Version in as many places as we can (wherever it makes 
> sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
> out. That way, we can have a bunch of deprecated API now, that will be gone 
> in 4.0, rather than doing it one class at a time and never finish :).
> The purpose is to introduce Version wherever it is mandatory now, and also in 
> places where we think it might be useful in the future (like most of our 
> Analyzers, configured classes and configuration classes).
> I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
> think it will be done one step at a time, perhaps for cluster of classes 
> together. But on the other hand I don't want to mark it for 4.0.0 because 
> that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
> mark it for 3.9. We can do several commits in one issue right? So this one 
> can live for a while in JIRA, while we gradually convert more and more 
> classes.
> The first candidate is InstantiatedIndexWriter which probably should take an 
> IndexWriterConfig. While I converted the code to use IWC, I've noticed 
> Instantiated defaults its maxFieldLength to the current default (10,000) 
> which is deprecated. I couldn't change it for back-compat reasons. But we can 
> upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
> 3.1, otherwise stay w/ the deprecated default.
> if it's acceptable to have several commits in one issue, I can start w/ 
> Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-17 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846526#action_12846526
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Mike, can you clarify why intUptos and intUptoStart are member
variables in TermsHashPerField? Can't the accessors simply refer
to IntBlockPool for these? I'm asking because in IntBlockPool flush,
for now I'm simply calling nextBuffer to shuffle the current
writable array into a read only state (ie, all of the arrays
being written to prior to flush will now be readonly).

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846522#action_12846522
 ] 

Robert Muir commented on LUCENE-2323:
-

Here are my initial thoughts on this. 
I don't think we have to do the entire thing at one time either:

* fold contrib/regex into contrib/queries:
** alongside 'similar' you would see 'regex'
* fold the four queryparsers in contrib/misc into contrib/queryparser, so under 
o.a.l.qp you would see:
** core
** standard
** complexPhrase
** ext
** precedence
** analyzing
* fold the fast-vector-highlighter under highlighter, so under o.a.l.search you 
would see:
** highlight
** vectorhighlight

In a second phase, potentially a different issue, i would like to discuss 
issues that might involve backwards breaks:
* xml-query-parser: really belongs in contrib/queryparser, but we would have to 
change pkg names.
* wikipedia: really belongs in analysis, but we would have to change pkg names.
* contrib/surround: what to do? has both queryparser and queries, maybe should 
stay as is.

Any objections to doing the first part, which has no pkg naming changes?


> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



IndexWriter.synced field accumulates data

2010-03-17 Thread Gregor Kaczor
I am running into a strange OutOfMemoryError. My small test application does 
index and delete some few files. This is repeated for 60k times.  Optimization 
is run from every 2k times a file is indexed. Index size is 50KB. I did analyze 
the HeapDumpFile and realized that IndexWriter.synced field occupied more than 
half of the heap. That field is a private HashSet without a getter. Its task is 
to hold files which have been synced already. 

There are two calls to addAll and one call to add on synced but no remove or 
clear throughout the lifecycle of the IndexWriter instance. 

According to the Eclipse Memory Analyzer synced contains 32618 entries which 
look like file names "_e065_1.del" or "_e067.cfs" 

The index directory contains 10 files only.

I guess synced is holding obsolete data

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-17 Thread Shane (JIRA)
IndexOutOfBoundsException in FieldInfos.java


 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor


When retrieving the scoreDocs from a multisearcher, the following exception is 
thrown:

java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
at 
org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)

The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
greater than the size of array list containing the FieldInfo values.  I am not 
sure what the field number represents or why it would be larger than the array 
list's size.  The quick fix would be to validate the bounds but there may be a 
bigger underlying problem.  The issue does appear to be directly related to 
LUCENE-939.  I've only been able to duplicate this in my production environment 
and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846462#action_12846462
 ] 

Mark Miller commented on LUCENE-2305:
-

Hmm - if I remember right, this is one I brought up before and you said you no 
longer felt it really made sense to default to true Mike?

> Introduce Version in more places long before 4.0
> 
>
> Key: LUCENE-2305
> URL: https://issues.apache.org/jira/browse/LUCENE-2305
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> We need to introduce Version in as many places as we can (wherever it makes 
> sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
> out. That way, we can have a bunch of deprecated API now, that will be gone 
> in 4.0, rather than doing it one class at a time and never finish :).
> The purpose is to introduce Version wherever it is mandatory now, and also in 
> places where we think it might be useful in the future (like most of our 
> Analyzers, configured classes and configuration classes).
> I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
> think it will be done one step at a time, perhaps for cluster of classes 
> together. But on the other hand I don't want to mark it for 4.0.0 because 
> that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
> mark it for 3.9. We can do several commits in one issue right? So this one 
> can live for a while in JIRA, while we gradually convert more and more 
> classes.
> The first candidate is InstantiatedIndexWriter which probably should take an 
> IndexWriterConfig. While I converted the code to use IWC, I've noticed 
> Instantiated defaults its maxFieldLength to the current default (10,000) 
> which is deprecated. I couldn't change it for back-compat reasons. But we can 
> upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
> 3.1, otherwise stay w/ the deprecated default.
> if it's acceptable to have several commits in one issue, I can start w/ 
> Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846461#action_12846461
 ] 

Michael McCandless commented on LUCENE-2305:


Sigh, yes, adding Version to MP makes sense.

> Introduce Version in more places long before 4.0
> 
>
> Key: LUCENE-2305
> URL: https://issues.apache.org/jira/browse/LUCENE-2305
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> We need to introduce Version in as many places as we can (wherever it makes 
> sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
> out. That way, we can have a bunch of deprecated API now, that will be gone 
> in 4.0, rather than doing it one class at a time and never finish :).
> The purpose is to introduce Version wherever it is mandatory now, and also in 
> places where we think it might be useful in the future (like most of our 
> Analyzers, configured classes and configuration classes).
> I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
> think it will be done one step at a time, perhaps for cluster of classes 
> together. But on the other hand I don't want to mark it for 4.0.0 because 
> that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
> mark it for 3.9. We can do several commits in one issue right? So this one 
> can live for a while in JIRA, while we gradually convert more and more 
> classes.
> The first candidate is InstantiatedIndexWriter which probably should take an 
> IndexWriterConfig. While I converted the code to use IWC, I've noticed 
> Instantiated defaults its maxFieldLength to the current default (10,000) 
> which is deprecated. I couldn't change it for back-compat reasons. But we can 
> upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
> 3.1, otherwise stay w/ the deprecated default.
> if it's acceptable to have several commits in one issue, I can start w/ 
> Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846455#action_12846455
 ] 

Shai Erera commented on LUCENE-2305:


While working on LUCENE-2320, I've noticed these two lines in MP:

{code}
  /* TODO 3.0: change this default to true */
  protected boolean calibrateSizeByDeletes = false;
{code}

Which were left out when we upgraded to 3.0. I guess MP just needs a Version, 
and then we can change that parameter to true if Version is later than 3.1 (or 
when this change is out)?

> Introduce Version in more places long before 4.0
> 
>
> Key: LUCENE-2305
> URL: https://issues.apache.org/jira/browse/LUCENE-2305
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> We need to introduce Version in as many places as we can (wherever it makes 
> sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
> out. That way, we can have a bunch of deprecated API now, that will be gone 
> in 4.0, rather than doing it one class at a time and never finish :).
> The purpose is to introduce Version wherever it is mandatory now, and also in 
> places where we think it might be useful in the future (like most of our 
> Analyzers, configured classes and configuration classes).
> I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
> think it will be done one step at a time, perhaps for cluster of classes 
> together. But on the other hand I don't want to mark it for 4.0.0 because 
> that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
> mark it for 3.9. We can do several commits in one issue right? So this one 
> can live for a while in JIRA, while we gradually convert more and more 
> classes.
> The first candidate is InstantiatedIndexWriter which probably should take an 
> IndexWriterConfig. While I converted the code to use IWC, I've noticed 
> Instantiated defaults its maxFieldLength to the current default (10,000) 
> which is deprecated. I couldn't change it for back-compat reasons. But we can 
> upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
> 3.1, otherwise stay w/ the deprecated default.
> if it's acceptable to have several commits in one issue, I can start w/ 
> Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846451#action_12846451
 ] 

Michael McCandless commented on LUCENE-2320:


Patch looks good Shai!  I'd rather go with the SetOnce approach than introduce 
a single-use factory for IW to create the MP.

But, I don't think we should keep the MP ctors that take IW?  Ie, you make the 
MP then call .setIW on it?  We can just remove them (and advertise this in the 
CHANGES bw break entry) since it's an experimental API...

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposed: New logged IRC channel: #lucene_dev

2010-03-17 Thread Shai Erera
I personally prefer that discussions happen on the list/JIRA. It is ok if
some discussions outside these two come in to list/JIRA as a thread/issue.
Such discussions are not limited to only IRC, but also a phone call, email,
ApacheCon etc. However once this has been raised w/ the community, then any
further discussion needs to happen there, for everybody to read/comment.

Emails/JIRA overcome time zone differences. Also if things are being
discussed on both (IRC and JIRA for example), some of the data is lost, or
not coherent with what's being said on the other medium. I don't think that
logging the IRC channel will do any good, because when I wake up in the
morning and check the new emails I've received, I wouldn't know about all
those IRC discussions ... and summary emails are not too good (nor they are
bad, they are better than nothing) IMO.

I believe though that I'm on the minority side on this ... hope I'm wrong.

Shai

On Wed, Mar 17, 2010 at 5:44 PM, Steven A Rowe  wrote:

> As I mentioned in another thread on this list, I'm interested in setting up
> a permanent, linkable-to archive (a.k.a. log) for the lucene IRC channel.
>
> On #lucene, some devs don't want to be logged, and so will not participate
> on a logged IRC channel.  Other devs want logging, to be able to point to
> for background discussion, and also to provide transparency for those who
> don't participate regularly.
>
> In an attempt to satisfy both camps, it was suggested that a new logged
> channel be set up.  The name #lucene_dev was suggested, in part because this
> name indicates separating user questions, which don't really need to be
> logged, from development, in the same way that the Lucene mailing lists are
> now separated.
>
> Note that this is not an official proposal/vote or anything like that.  I'm
> just looking for comments that would change or stop this proposal from being
> implemented.
>
> I should say that when I asked about logging #lucene, I didn't anticipate
> that it would be controversial (I've always thought that naïveté is my best
> feature ;) ).  I'm unhappy that creating yet another place to discuss Lucene
> could fracture the community.  So I'm not certain that setting up a logged
> #lucene_dev is the right way to go.
>
> Please let me know what you think.
>
> Thanks,
> Steve
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Proposed: New logged IRC channel: #lucene_dev

2010-03-17 Thread Steven A Rowe
As I mentioned in another thread on this list, I'm interested in setting up a 
permanent, linkable-to archive (a.k.a. log) for the lucene IRC channel.

On #lucene, some devs don't want to be logged, and so will not participate on a 
logged IRC channel.  Other devs want logging, to be able to point to for 
background discussion, and also to provide transparency for those who don't 
participate regularly.

In an attempt to satisfy both camps, it was suggested that a new logged channel 
be set up.  The name #lucene_dev was suggested, in part because this name 
indicates separating user questions, which don't really need to be logged, from 
development, in the same way that the Lucene mailing lists are now separated.

Note that this is not an official proposal/vote or anything like that.  I'm 
just looking for comments that would change or stop this proposal from being 
implemented.

I should say that when I asked about logging #lucene, I didn't anticipate that 
it would be controversial (I've always thought that naïveté is my best feature 
;) ).  I'm unhappy that creating yet another place to discuss Lucene could 
fracture the community.  So I'm not certain that setting up a logged 
#lucene_dev is the right way to go.

Please let me know what you think.

Thanks,
Steve


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-03-17 Thread Ritesh Nigam (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ritesh Nigam updated LUCENE-2280:
-

Attachment: lucene.zip

Lucene infostream log file.

> IndexWriter.optimize() throws NullPointerException
> --
>
> Key: LUCENE-2280
> URL: https://issues.apache.org/jira/browse/LUCENE-2280
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3.2
> Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
>Reporter: Ritesh Nigam
> Attachments: lucene.jar, lucene.zip
>
>
> I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
> database which creates approax 200MB index file, after finishing the indexing 
> and while running optimize() i can see NullPointerExcception thrown in my log 
> and index file is getting corrupted, log says
> 
> Caused by: 
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
>   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
>   at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
>   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
>   at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
>   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
>   at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> 
> and this is happening quite frequently, although I am not able to reproduce 
> it on demand, I saw an issue logged which is some what related to mine issue 
> (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
>  but the only difference here is I am not using Store.Compress for my fields, 
> i am using Store.NO instead. please note that I am using IBM JRE for my 
> application.
> Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-03-17 Thread Ritesh Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846408#action_12846408
 ] 

Ritesh Nigam commented on LUCENE-2280:
--

Yesterday again search indxer crashed for my application and index file got 
deleted, this time had turned on the infostream on for indexwriter, attaching 
the infostream log file.

> IndexWriter.optimize() throws NullPointerException
> --
>
> Key: LUCENE-2280
> URL: https://issues.apache.org/jira/browse/LUCENE-2280
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3.2
> Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
>Reporter: Ritesh Nigam
> Attachments: lucene.jar
>
>
> I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
> database which creates approax 200MB index file, after finishing the indexing 
> and while running optimize() i can see NullPointerExcception thrown in my log 
> and index file is getting corrupted, log says
> 
> Caused by: 
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
>   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
>   at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
>   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
>   at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
>   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
>   at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> 
> and this is happening quite frequently, although I am not able to reproduce 
> it on demand, I saw an issue logged which is some what related to mine issue 
> (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
>  but the only difference here is I am not using Store.Compress for my fields, 
> i am using Store.NO instead. please note that I am using IBM JRE for my 
> application.
> Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-17 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2326.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [New])

Committed revision: 924207

> Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
> branch and linking snowball tests by svn:externals
> ---
>
> Key: LUCENE-2326
> URL: https://issues.apache.org/jira/browse/LUCENE-2326
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch, 3.1
>
> Attachments: LUCENE-2326.patch, LUCENE-2326.patch
>
>
> As we often need to update backwards tests together with trunk and always 
> have to update the branch first, record rev no, and update build xml, I would 
> simply like to do a svn copy/move of the backwards branch.
> After a release, this is simply also done:
> {code}
> svn rm backwards
> svn cp releasebranch backwards
> {code}
> By this we can simply commit in one pass, create patches in one pass.
> The snowball tests are currently downloaded by svn.exe, too. These need a 
> fixed version for checkout. I would like to change this to use svn:externals. 
> Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-17 Thread Stefan Trcek
On Tuesday 16 March 2010 14:12:20 Mark Miller wrote:
> On 03/16/2010 09:05 AM, Andrzej Bialecki wrote:
> >
> > You could have used git instead. There is a good integration
> > between git and svn, and it's much easier (a giant
> > understatement...) to handle branching and merging in git, both
> > between git branches and syncing with external svn.
>
> Yeah, we have actually discussed doing things like GIT in the past -
> prob main reason we didn't is learning curve at the moment. I haven't
> used it yet.

I jumped off perforce by using a git-perforce bridge for daily work.
This made me comfortable with git while not changing anything public. 
And I had the certainty that if anything goes wrong, I can do it in 
perforce. Meanwhile we migrated a 2GB trunk sources repo from a legacy 
repo to git and it works fine. So don't hesitate do use a git-svn 
bridge.

git will open new modes of operation, e.g. instead of up- and 
downloading patches in jira just hold a branch for any patch in a repo, 
which is as public as jira, batch-upgrade that branches/patches to 
trunk and pull that branches into the core developers repo as desired.

Stefan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846362#action_12846362
 ] 

Uwe Schindler commented on LUCENE-2326:
---

Will commit soon to trunk and merge to flex.

> Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
> branch and linking snowball tests by svn:externals
> ---
>
> Key: LUCENE-2326
> URL: https://issues.apache.org/jira/browse/LUCENE-2326
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch, 3.1
>
> Attachments: LUCENE-2326.patch, LUCENE-2326.patch
>
>
> As we often need to update backwards tests together with trunk and always 
> have to update the branch first, record rev no, and update build xml, I would 
> simply like to do a svn copy/move of the backwards branch.
> After a release, this is simply also done:
> {code}
> svn rm backwards
> svn cp releasebranch backwards
> {code}
> By this we can simply commit in one pass, create patches in one pass.
> The snowball tests are currently downloaded by svn.exe, too. These need a 
> fixed version for checkout. I would like to change this to use svn:externals. 
> Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-17 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2326:
--

Attachment: LUCENE-2326.patch

New patch, which has some optimizations. It now also allows to run "ant test" 
from a source distribution ZIP/TGZ, which does not contain the backwards 
folder. The tests will not fail, instead print a warning message.

> Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
> branch and linking snowball tests by svn:externals
> ---
>
> Key: LUCENE-2326
> URL: https://issues.apache.org/jira/browse/LUCENE-2326
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch, 3.1
>
> Attachments: LUCENE-2326.patch, LUCENE-2326.patch
>
>
> As we often need to update backwards tests together with trunk and always 
> have to update the branch first, record rev no, and update build xml, I would 
> simply like to do a svn copy/move of the backwards branch.
> After a release, this is simply also done:
> {code}
> svn rm backwards
> svn cp releasebranch backwards
> {code}
> By this we can simply commit in one pass, create patches in one pass.
> The snowball tests are currently downloaded by svn.exe, too. These need a 
> fixed version for checkout. I would like to change this to use svn:externals. 
> Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-17 Thread Earwin Burrfoot
Some of these people got traumatized by maven, now they only can think
in terms of "mash everything together and sprinkle with
hand-downloaded dependency jars".
No offence : )

I, personally, prefer side-by-side layouts. You can add new stuff, and
wire dependencies to the old one, without reorganizing the tree. You
can checkout everything, or just the subset you need.
There is also another way - separate trunks for the modules, so they
can be release-managed separately, but there is a toplevel directory
that svn:external's all these trunks and allows checking
out/building/testing everything at once.

On Wed, Mar 17, 2010 at 11:51, Wouter Heijke  wrote:
> I'm just a surprised observer that doesn't seems to get all the trouble
> and need for this svn merge.
>
> I have my own private Solr-like framework around Lucene. It uses maven to
> build and nicely gets all dependencies to Lucene and Tika whenever I build
> or release, no problem there and certainly no need to have it merged into
> Lucene's svn!
>
> Professionally i work on a (world-class) geocoder that also nicely depends
> on Lucene by using maven, no problems there at all and no need to merge
> that code in Lucene's svn!
>
> Wouter
>
>> But it's actually the reverse?  Solr depends on Lucene but not vice/versa.
>>
>> (If instead I proposed making Solr a subdir of Lucene then I'd agree)
>>
>> So... if you checkout only lucene, you can cd there and do all you do
>> today with Lucene ("ant test", "ant dist", "svn diff", etc.).
>>
>> If you checkout solr, you can cd there and "ant test" will run all of
>> Lucene's and all of Solr's tests.  "svn diff" will include any changes
>> to lucene and to solr.
>>
>> Ie this achieves want we want -- Solr to depend on Lucene but not vice
>> versa, right?
>>
>> Mike
>>
>> On Tue, Mar 16, 2010 at 5:18 PM, Shai Erera  wrote:
>>> I have to agree w/ Jake that putting Lucene under Solr gives the
>>> impression
>>> as if suddenly Lucene became dependent on it ... and for really no good
>>> reasons. Are we making that decision to simplify the build of Solr? What
>>> are
>>> the problems Solr faces today w.r.t. its build and using a Lucene
>>> release or
>>> trunk revision?
>>>
>>> I didn't follow the Lucene/Solr merge on general@, because I didn't even
>>> know such a beast exists. So I guess I'm missing something ...
>>>
>>> Shai
>>>
>>> On Wed, Mar 17, 2010 at 12:01 AM, Jake Mannix 
>>> wrote:

 On Tue, Mar 16, 2010 at 2:53 PM, Yonik Seeley  wrote:
>
> > Chiming in just a bit here - isn't there any concern that
> independent
> > of
> > whether or not people "can"
> > build lucene without checking out solr, the mere fact that Lucene
> will
> > be
> > effectively a "subdirectory"
> > of solr...  is there no concern that there will then be a perception
> > that Lucene is a subproject of
> > Solr, instead of vice-versa?
>
> Who would have this perception?
> Casual users will be using downloads.

 Developers and dev managers at companies doing build vs. buy decisions
 regarding
 whether they will do one of the following:
 1) pay big bucks to get FAST or whatever
 2) use Solr (free/cheap!)
 3) pay [variable] bucks to build their own with Lucene
 4) pay [variable but high] to build their own from scratch
 I'm not concerned with casual downloaders.  I'm talking about the
 companies and people who
 may or may not be interested in making multi-million dollar decisions
 regarding using or
 not using Lucene or Solr.
   -jake
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-17 Thread Wouter Heijke
I'm just a surprised observer that doesn't seems to get all the trouble
and need for this svn merge.

I have my own private Solr-like framework around Lucene. It uses maven to
build and nicely gets all dependencies to Lucene and Tika whenever I build
or release, no problem there and certainly no need to have it merged into
Lucene's svn!

Professionally i work on a (world-class) geocoder that also nicely depends
on Lucene by using maven, no problems there at all and no need to merge
that code in Lucene's svn!

Wouter

> But it's actually the reverse?  Solr depends on Lucene but not vice/versa.
>
> (If instead I proposed making Solr a subdir of Lucene then I'd agree)
>
> So... if you checkout only lucene, you can cd there and do all you do
> today with Lucene ("ant test", "ant dist", "svn diff", etc.).
>
> If you checkout solr, you can cd there and "ant test" will run all of
> Lucene's and all of Solr's tests.  "svn diff" will include any changes
> to lucene and to solr.
>
> Ie this achieves want we want -- Solr to depend on Lucene but not vice
> versa, right?
>
> Mike
>
> On Tue, Mar 16, 2010 at 5:18 PM, Shai Erera  wrote:
>> I have to agree w/ Jake that putting Lucene under Solr gives the
>> impression
>> as if suddenly Lucene became dependent on it ... and for really no good
>> reasons. Are we making that decision to simplify the build of Solr? What
>> are
>> the problems Solr faces today w.r.t. its build and using a Lucene
>> release or
>> trunk revision?
>>
>> I didn't follow the Lucene/Solr merge on general@, because I didn't even
>> know such a beast exists. So I guess I'm missing something ...
>>
>> Shai
>>
>> On Wed, Mar 17, 2010 at 12:01 AM, Jake Mannix 
>> wrote:
>>>
>>> On Tue, Mar 16, 2010 at 2:53 PM, Yonik Seeley  wrote:

 > Chiming in just a bit here - isn't there any concern that
 independent
 > of
 > whether or not people "can"
 > build lucene without checking out solr, the mere fact that Lucene
 will
 > be
 > effectively a "subdirectory"
 > of solr...  is there no concern that there will then be a perception
 > that Lucene is a subproject of
 > Solr, instead of vice-versa?

 Who would have this perception?
 Casual users will be using downloads.
>>>
>>> Developers and dev managers at companies doing build vs. buy decisions
>>> regarding
>>> whether they will do one of the following:
>>> 1) pay big bucks to get FAST or whatever
>>> 2) use Solr (free/cheap!)
>>> 3) pay [variable] bucks to build their own with Lucene
>>> 4) pay [variable but high] to build their own from scratch
>>> I'm not concerned with casual downloaders.  I'm talking about the
>>> companies and people who
>>> may or may not be interested in making multi-million dollar decisions
>>> regarding using or
>>> not using Lucene or Solr.
>>>   -jake
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
>



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org