Re: TestCodecs running time

2010-04-15 Thread Shai Erera
See you already did that Mike :). Thanks ! now the tests run for 2s.

Shai

On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 It's also slow because it repeats all the tests for each of the core
 codecs (standard, sep, pulsing, intblock).

 I think it's fine to reduce the number of iterations -- just make sure
 there's no seed to newRandom() so the distributing testing is
 effective.

 Mike

 On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I've noticed that TestCodecs takes an insanely long time to run on my
  machine - between 35-40 seconds. Is that expected?
  The reason why it runs so long, seems to be that its threads make (each)
  4000 iterations ... is that really required to ensure correctness?
 
  Shai
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




SnapshotDeletionPolicy throws NPE if no commit happened

2010-04-15 Thread Shai Erera
SDP throws NPE if the index includes no commits, but snapshot() is called.
This is an extreme case, but can happen if one takes snapshots (for backup
purposes for example) in a separate code segment than indexing, and does not
know if commit was called or not.

I think we should throw an IllegalStateException instead of falling on NPE,
w/ a descriptive message. Alternatively, we can just return null and
document it ... But I prefer the ISE instead. What do you think?

Shai


Re: SnapshotDeletionPolicy throws NPE if no commit happened

2010-04-15 Thread Shai Erera
Well ... one can still call commit() or close() right after IW creation. And
this is a very rare case to be hit by. Was just asking about whether we want
to add an explicit and clear protective code about it or not.

Shai

On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com wrote:

 We should just let IW create a null commit on an empty directory, like
 it always did ;)
 Then a whole class of such problems disappears.

 On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
  SDP throws NPE if the index includes no commits, but snapshot() is
 called.
  This is an extreme case, but can happen if one takes snapshots (for
 backup
  purposes for example) in a separate code segment than indexing, and does
 not
  know if commit was called or not.
 
  I think we should throw an IllegalStateException instead of falling on
 NPE,
  w/ a descriptive message. Alternatively, we can just return null and
  document it ... But I prefer the ISE instead. What do you think?
 
  Shai
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Well ... I think that version numbers mean more than we'd like them to mean,
as people perceive them. Let's discuss the format X.Y.Z:

When X is changed, it should mean something 'big' happened - index structure
has changed (e.g. the flexible scoring work), new Java version supported
(Java 1.6) and even stuff like 'flex' which includes statements like if you
don't want your app to slow down, consider reindexing. Such things signal a
major change in Lucene, sometimes even just policy changes (Java version
supported) and therefore I think we should reserve the ability to bump X
when such things happen.

Another thing is the index structure back-compat policy - today Lucene
supports X-1 index structure, but during upgrades of X.Y versions, your
segments are gradually migrated. Eventually, when you upgrade to 4.0 you
should know whether you have a 2.x index, and call optimize just in case if
you're not sure it's not migrated yet (if you've upgraded to 3.x).
If we start bumping up 'X' too often, we'll either need to change the X-1
policy to X-N, which will just complicate matters for users. Or we'll keep
the X-1 policy, but people will need to call optimize more frequently.

Y should change on a regular basis, and no back-compat API-wise or index
runtime-wise is guaranteed. So the Collector and per-segment searches in 2.9
could go w/o deprecating tons of API, so is the TokenStream work. Changes to
Analyzer's runtime capabilities will also be allowed between Y revisions.

Z should change when bugfixes are fixed, or when features are backported.
Really ... we rarely fix bugs on a released Y branch, and I don't expect
tons of features will be backported to a Y branch (to create a Z+1 release).
Therefore this should not confuse anyone.

So all I'm saying is that instead of increasing X whenever the API, index
structure or runtime behavior has changed, I'm simply proposing to
differentiate between really major changes to those that just say 'we're
not back-compat compliant'.

But above all, I'd like to see this change happening, so if I need to
surrender to the X vs. X+Y approach, I will. Just think it will create some
confusion.

BTW, w/ all that - does it mean 'backwards' can be dropped, or at least
test-backwards activated only on a branch which we decide needs it? That'll
be really great.

Shai

On Thu, Apr 15, 2010 at 10:24 AM, Earwin Burrfoot ear...@gmail.com wrote:

 We can remove Version, because all incompatible changes go straight to
 a new major release, which we release more often, yes.
 3.x is going to be released after 4.0 if bugs are found and fixed, or
 if people ask to backport some (minor?) features, and some dev has
 time for this.

 The question of what to call major release in X.Y.Z scheme - X or Y,
 is there, but immaterial :) I think it's okay to settle with X.Y, we
 have major releases and bugfixes, what that third number can be used
 for?

 On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote:
  So then I don't understand this:
 
  {quote}
  * A major release always bumps the major release number (2.x -
 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
 releases along that branch
 
  * There is no back compat across major releases (index nor APIs),
 but full back compat within branches.
 
  {quote}
 
  What's different than what's done today? How can we remove Version in
 that
  world, if we need to maintain full back-compat between 3.1 and 3.2, index
  and API-wise? We'll still need to deprecate and come up w/ new classes
 every
  time, and we'll still need to maintain runtime changes back-compat.
 
  Unless you're telling me we'll start releasing major releases more often?
  Well ... then we're saying the same thing, only I think that instead of
  releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
  because if you look back, every minor release included API deprecations
 as
  well as back-compat breaks. That means that every minor release should
 have
  been a major release right?
 
  Point is, if I understand correctly and you agree w/ my statement above -
 I
  don't see why would anyone releases a 3.x after 4.0 is out unless someone
  really wants to work hard on maintaining back-compat of some features.
 
  If it's just a numbering thing, then I don't think it matters what is
  defined as 'major' vs. 'minor'. One way is to define 'major' as X and
 minor
  X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I
 prefer
  the latter but don't have any strong feelings against the former. Just
  pointing out that X will grow more rapidly than today. That's all.
 
  So did I get it right?
 
  Shai
 
  On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  I don't read what you wrote and what Mike wrote as even close to the
  same.
 
  - Mark
  http://www.lucidimagination.com (mobile)
  On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote:
 
  Ahh ... a dream finally comes true ... what

Re: SnapshotDeletionPolicy throws NPE if no commit happened

2010-04-15 Thread Shai Erera
BTW, even if it's a stupid thing to do, someone can today create SDP and
call snapshot without ever creating IW. And it's not an impossible scenario.
Consider a backup-aware application which creates SDP first, then passes it
to the indexing process and the backup process, separately. The backup
process doesn't need to know of IW at all, and might call snapshot() before
IW was even created, and SDP.onInit was called. It's a possibility, not
saying it's a great and safe architecture.

So this is really about do we want to write clear protective code, or allow
the NPE?

Shai

2010/4/15 Shai Erera ser...@gmail.com

 Well ... one can still call commit() or close() right after IW creation.
 And this is a very rare case to be hit by. Was just asking about whether we
 want to add an explicit and clear protective code about it or not.

 Shai


 On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.comwrote:

 We should just let IW create a null commit on an empty directory, like
 it always did ;)
 Then a whole class of such problems disappears.

 On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
  SDP throws NPE if the index includes no commits, but snapshot() is
 called.
  This is an extreme case, but can happen if one takes snapshots (for
 backup
  purposes for example) in a separate code segment than indexing, and does
 not
  know if commit was called or not.
 
  I think we should throw an IllegalStateException instead of falling on
 NPE,
  w/ a descriptive message. Alternatively, we can just return null and
  document it ... But I prefer the ISE instead. What do you think?
 
  Shai
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Well ... I must say that I completely disagree w/ dropping index structure
back-support. Our customers will simply not hear of reindexing 10s of TBs of
content because of version upgrades. Such a decision is key to Lucene
adoption in large-scale projects. It's entirely not about whether Lucene is
a content store or not - content is stored on other systems, I agree. But
that doesn't mean reindexing it is tolerable.

Up until now, Lucene migrated my segments gradually, and before I upgraded
from X+1 to X+2 I could run optimize() to ensure my index will be readable
by X+2. I don't think I can myself agree to it, let alone convince all the
stakeholders in my company who adopt Lucene today in numerous projects, to
let go of such capability. We've been there before (requiring reindexing on
version upgrades) w/ some offerings and customers simply didn't like it and
were forced to use an enterprise-class search engine which offered less (and
didn't use Lucene, up until recently !). Until we moved to Lucene ...

What's Solr's take on it?

I differentiate between structural changes and runtime changes. I, myself,
don't mind if we let go of back-compat support for runtime changes, such as
those generated by analyzers. For a couple of reasons, the most important
ones are (1) these are not so frequent (but so is index structural change)
and (2) that's a decision I, as the application developer, makes - using or
not a newer version of an Analyzer. I don't mind working hard to make a 2.x
Analyzer version work in the 3.x world, but I cannot make a 2.x index
readable by a 3.x Lucene jar, if the latter doesn't support it. That's the
key difference, in my mind, between the two. I can choose not to upgrade at
all to a newer analyzer version ... but I don't want to be forced to stay w/
older Lucene versions and features because of that ... well people might say
that it's not Lucene's problem, but I beg to differ. Lucene benefits from
wider and faster adoption and we rely on new features to be adopted quickly.
That might be jeopardized if we let go of that strong capability, IMO.

What we can do is provide an index migration tool ... but personally I don't
know what's the difference between that and gradually migrating segments as
they are merged, code-wise. I mean - it has to be the same code. Only an
index migration tool may take days to complete on a very large index, while
the ongoing migration takes ~0 time when you come to upgrade to a newer
Lucene release.

And the note about Terrier requiring reindexing ... well I can't say it's a
strength of it but a damn big weakness IMO.

About the release pace, I don't think we can suddenly release every 2 years
... makes people think the project is stuck. And some out there are not so
fond of using a 'trunk' version and release it w/ their products because
trunk is perceived as ongoing development (which it is) and thus less
stable, or is likely to change and most importantly harder to maintain (as
the consumer). So I still think we should release more often than not.

That's why I wanted to differentiate X and Y, but I don't mind if we release
just X ... if that's so important to people. BTW Mike, Eclipse's releases
are like Lucene, and in fact I don't know of so many projects that just
release X ... many of them seem to release X.Y.

I don't understand why we're treating this as a all or nothing thing. We
can let go of API back-compat, that clearly has no affect on index structure
and content. We can even let go of index runtime changes for all I care. But
I simply don't think we can let go of index structure back-support.

Shai

On Thu, Apr 15, 2010 at 1:12 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 2010/4/15 Shai Erera ser...@gmail.com:

  One way is to define 'major' as X and minor X.Y, and another is to define
 major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any
 strong feelings against the former.

 I prefer X.Y, ie, changes to Y only is a minor release (mostly bug
 fixes but maybe small features); changes to X is a major release.  I
 think that's more standard, ie, people will generally grok that 3.3
 - 4.0 is a major change but 3.3 - 3.4 isn't.

 So this proposal would change how Lucene releases are numbered.  Ie,
 the next release would be 4.0.  Bug fixes / small features would then
 be 4.1.

  Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise.

 No... in the proposal, you must re-index on upgrading to the next
 major release (3.x - 4.0).

 I think supporting old indexes, badly (what we do today) is not a
 great solution.  EG on upgrading to 3.1 you'll immediately see a
 search perf hit since the flex emulation layer is running.  It's a
 trap.

 It's this freedom, I think, that'd let us drop Version entirely.  It's
 the back-compat of the index that is the major driver for having
 Version today (eg so that the analyzers can produce tokens matching
 your old index).

 EG Terrier seems

Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Thanks Danil - you reminded me of another reason why reindexing is
impossible - fetching the data, even if it's available is too damn costly.

Robert, I think you're driven by Analyzers changes ... been too much around
them I'm afraid :).

A major version upgrade is a move to Java 1.5 for example. I can do that,
and I don't see why I need to reindex my data because of that. And I simply
don't buy that do this work on your own ... people can take a snapshot of
the code, maintain it separately and you'll never hear back from them. Who
benefits - neither !
It's open source - true, but it's way past the Hey look, I'm a new open
source project w/ a dozen users, I can do whatever I want. Lucene is a
respected open source project, w/ serious adoption and deployments. People
trust on the select few committers here to do it right for them, so they
don't need to invest the time and resources in developing core IR stuff. And
now you're pushing to do it yourself approach? I simply don't get or buy
it.

When were you struck w/ maintaining backwards change because the index
structure changed? I bet no so many of us, or shall I say just the few Mikes
out there? So how hard is it to require such back-compat support? I
wholeheartedly agree that we shouldn't keep back-compat on Analyzer changes,
nor on bugs such that one which changed the position of the field from -1 to
0 (a while ago - don't remember the exact details).

Shai

On Thu, Apr 15, 2010 at 3:17 PM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
 cost to do in a running production system (i can't shut it down for
 maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
 have no idea what are the costs to retrieve that data all over again, but i
 estimate it to be quite a lot)

 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.

 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.

 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.

 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.

 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?

 The idea here, is that development can be free of such hassles.
 Development should be this way.

 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.


 --
 Robert Muir
 rcm...@gmail.com





Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
I can live w/ that Earwin ... I prefer the ongoing upgrades still, but I
won't hold off the back-compat policy change vote because of that.

Shai

On Thu, Apr 15, 2010 at 3:30 PM, Earwin Burrfoot ear...@gmail.com wrote:

 I think an index upgrade tool is okay?
 While you still definetly have to code it, things like if idxVer==m
 doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away
 from lucene innards and we all profit?

 On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote:
  its open source, if you feel this way, you can put the work to add
 features
  to some version branch from trunk in a backwards compatible way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
 prohibitive
  cost to do in a running production system (i can't shut it down for
  maintainance, so i need a lot of hardware to reindex ~5 billion
 documents, i
  have no idea what are the costs to retrieve that data all over again,
 but i
  estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is key
 to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other systems,
 I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
  reindex, and are willing to do the work to port it back to Version 3 in
 a
  completely backwards compatible way, then under this new scheme it can
  happen.
 
  --
  Robert Muir
  rcm...@gmail.com
 
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Well ... I could argue that it's you who miss the point :).

I completely don't buy the all the new features comment -- how many new
features are in a major release which force you to consider reindexing? Yet
there are many of them that change the API. How will I know whether a
release supports my index or not? Why do I need to work hard to back-port
all the new developed issues onto a branch I use? How many of those branches
will exist? Will they all run nightly unit tests? Can I cut a release of
such branch myself? Or will I need the PMC or a VOTE? This will get
complicated pretty fast ...

Lucene is not a do it yourself kit - we try so hard to have the best
defaults, best out of the box experience ... best everything for our users.
Even w/ Analyzers we try so damn hard. While we could have simply
componentize everything and tell the users you can use those filters,
tokenizers, segment mergers, policies etc. to make up your indexing
application ...

And I don't think there are features out there that exist and are not
contributed because people are afraid of the index format changes ...
obviously if they have done it, they're passed the fear of handling index
format ... I'd like to hear of one such feature. I'd bet there are such out
there that are not contributed for IP, Business and Laziness reasons.

Shai

On Thu, Apr 15, 2010 at 3:56 PM, Robert Muir rcm...@gmail.com wrote:

 I think you guys miss the entire point.

 The idea that you can keep getting all the new features without
 reindexing is merely an illusion

 Instead, features simply aren't being added at all, because the policy
 makes it too cumbersome.

 Why is it problematic to have a different SVN branch/release, with lots of
 new features, but requires you to reindex and change your app?

 If its too difficult to reindex, it doesnt break your app that features
 exist elsewhere that you cannot access.
 Its the same as it is today, there are features you cannot access, except
 they do not even exist in apache SVN at all, even trunk, because of these
 problems.

 On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote:

 I like the idea of index conversion tool over silent online upgrade
 because it is
 1. controllable - with online upgrade you never know for sure when
 your index is completely upgraded, even optimize() won't help here, as
 it is a noop for already-optimized indexes
 2. way easier to write - as flex shows, index format changes are
 accompanied by API changes. Here you don't have to emulate new APIs
 over old structures (can be impossible for some cases?), you only have
 to, well, convert.

 On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
  All I ask is a way to migrate existing indexes to newer format.
 
 
  On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:
 
  its open source, if you feel this way, you can put the work to add
  features to some version branch from trunk in a backwards compatible
 way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com
 wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
  prohibitive cost to do in a running production system (i can't shut it
 down
  for maintainance, so i need a lot of hardware to reindex ~5 billion
  documents, i have no idea what are the costs to retrieve that data all
 over
  again, but i estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is
 crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com
 wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is
 key to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other
 systems, I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

2010-04-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857388#action_12857388
 ] 

Shai Erera commented on LUCENE-2396:


Robert I think this is great! Can we move more analyzers from core here? I 
think however that a backwards section in changes is important because it 
alerts users about those analyzers whose runtime behavior changed. Otherwise 
how would the poor uses know that? It doesn't mean you need to maintain back 
compat support but at least alert them when things change.

Even if we eventually decide to remove API bw completely, a section in CHANGES 
will still be required to help users upgrade easily.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

2010-04-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857396#action_12857396
 ] 

Shai Erera commented on LUCENE-2396:


Static? Weren't you against that!? 

But if we remove back compat from analyzers why do we need Version? Or is this 
API bw that we remove?

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
I seriously don't understand the fuss around index format back compat.
How many times is this changed such that it is too much to ask to keep
X support X-1?

I prefer to have ongoing segment merging but can live w/ a manual
converter tool. Thing is - I'll probably not be able to develop one
myself outside the scope of Lucene because I'll miss tons of API. So
having Lucene declare and adhere to it seems reasonable to me.

BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
manual migration on the segments that are still on old versions.
That's not the point about whether optimize() is good or not. It is
the difference between telling the customer to run a 5-day migration
process, or a couple of hours. At the end of the day, the same
migration code will need to be written whether for the manual or
automatic case. And probably by the same developer which changed the
index format. It's the difference of when does it happen.

And I also think that a manual migration tool will need access to some
lower level API which is not exposed today, and will generally not
perform as well as online migration. But that's a side note...

Shai

On Thursday, April 15, 2010, Earwin Burrfoot ear...@gmail.com wrote:
 I'd like to remind that Mike's proposal has stable branches.

 We can branch off preflex trunk right now and wrap it up as 3.1.
 Current trunk is declared as future 4.0 and all backcompat cruft is
 removed from it.
 If some new features/bugfixes appear in trunk, and they don't break
 stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3,
 etc

 Thus, devs are free to work without back-compat burden, bleeding edge
 users get their blood, conservative users get their stability + a
 subset of new features from stable branches.


 On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote:
 On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:

 First, the index format. IMHO, it is a good thing for a major release to
 be
 able to read the prior major release's index. And the ability to convert
 it
 to the current format via optimize is also good. Whatever is decided on
 this
 thread should take this seriously.


 Optimize is a bad way to convert to current.
 1. conversion is not guaranteed, optimizing already optimized index is a
 noop
 2. it merges all your segments. if you use BalancedSegmentMergePolicy,
 that destroys your segment size distribution

 Dedicated upgrade tool (available both from command-line and
 programmatically) is a good way to convert to current.
 1. conversion happens exactly when you need it, conversion happens for
 sure, no additional checks needed
 2. it should leave all your segments as is, only changing their format



 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well. To handle
 other languages well contrib/analyzers is required. Until recently it
 did
 not get much love. There have been many bw compat breaking changes
 (though
 w/ version one can probably get the prior behavior). IMHO, most of
 contrib/analyzers should be core. My guess is that most non-trivial
 applications will use contrib/analyzers.


 I counter - most non-trivial applications will use their own analyzers.
 The more modules - the merrier. You can choose precisely what you need.


 By and large an analyzer is a simple wrapper for a tokenizer and some
 filters. Are you suggesting that most non-trivial apps write their own
 tokenizers and filters?

 I'd find that hard to believe. For example, I don't know enough Chinese,
 Farsi, Arabic, Polish, ... to come up with anything better than what Lucene
 has to tokenize, stem or filter these.



 Our user base are those with ancient,
 underpowered laptops in 3-rd world countries. On those machines it might
 take 10 minutes to create an index and during that time the machine is
 fairly unresponsive. There is no opportunity to do it in the
 background.


 Major Lucene releases (feature-wise, not version-wise) happen like
 once in a year, or year-and-a-half.
 Is it that hard for your users to wait ten minutes once a year?


  I said that was for one index. Multiply that times the number of books
 available (300+) and yes, it is too much to ask. Even if a small subset is
 indexed, say 30, that's around 5 hours of waiting.

 Under consideration is the frequency of breakage. Some are suggesting a
 greater frequency than yearly.

 DM

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
The reason Earwin why online migration is faster is because when u
finally need to *fully* migrate your index, most chances are that most
of the segments are already on the newer format. Offline migration
will just keep the application idle for some amount of time until ALL
segments are migrated.

During the lifecycle of the index, segments are merged anyway, so
migrating them on the fly virtually costs nothing. At the end, when u
upgrade to a Lucene version which doesn't support the previous index
format, you'll on the worse case need to migrate few large segments
which were never merged. I don't know how many of those there will be
as it really depends on the application, but I'd bet this process will
touch just a few segments. And hence, throughput wise it will be a lot
faster.

We should create a migrate() API on IW which will touch just those
segments and not incur a full optimize. That API can also be used for
an offline migration tool, if we decide that's what we want.

Shai

On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote:
 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...

 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier

 On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.

 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.

 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
+1 on the Analyzers as well.

Earwin, I think I don't mind if we introduce migrate() elsewhere rather than
on IW. What I meant to say is that if we stick w/ index format back-compat
and ongoing migration, then such a method would be useful on IW for
customers to call to ensure they're on the latest version.
But if the majority here agree w/ a standalone tool, then I'm ok if it sits
elsewhere.

Grant, I'm all for 'just doing it and see what happens'. But I think we need
to at least decide what we're going to do so it's clear to everyone. Because
I'd like to know if I'm about to propose an index format change, whether I
need to build migration tool or not. Actually, I'd like to know if people
like Robert (basically those who have no problem to reindex and don't
understand the fuss around it) will want to change the index format - can I
count on them to be asked to provide such tool? That's to me a policy we
should decide on ... whatever the consequences.

But +1 for changing something ! Analyzers at first, API second.

Shai

On Thu, Apr 15, 2010 at 10:52 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote:
  for now simply moving analyzers to its own jar filE would be a great
 step!

 +1 -- why not consolidate all analyzers now?  (And fix indexer to
 require a minimal API = TokenStream minus reset  close).

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Grant ... you've made it - the 100th response to that thread. Do we keep
records somewhere? :)

Ok I'm simply proposing to define 'index back-compat' as index format
back-compat. With that, we don't 'wait' for something to happen, we just say
up front that if that changes, we provide a migration tool for the latest
index format version. Simple as that. The rest, we can 'see what happens'
...

Shai

On Thu, Apr 15, 2010 at 11:29 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Apr 15, 2010, at 4:21 PM, Shai Erera wrote:

  +1 on the Analyzers as well.
 
  Earwin, I think I don't mind if we introduce migrate() elsewhere rather
 than on IW. What I meant to say is that if we stick w/ index format
 back-compat and ongoing migration, then such a method would be useful on IW
 for customers to call to ensure they're on the latest version.
  But if the majority here agree w/ a standalone tool, then I'm ok if it
 sits elsewhere.
 
  Grant, I'm all for 'just doing it and see what happens'. But I think we
 need to at least decide what we're going to do so it's clear to everyone.
 Because I'd like to know if I'm about to propose an index format change,
 whether I need to build migration tool or not. Actually, I'd like to know if
 people like Robert (basically those who have no problem to reindex and don't
 understand the fuss around it) will want to change the index format - can I
 count on them to be asked to provide such tool? That's to me a policy we
 should decide on ... whatever the consequences.

 As I said, we should strive for index compatibility, but even in the past,
 we said we did, but the implications weren't always clear.   I think index
 compatibility is very important.  I've seen plenty of times where reindexing
 is not possible.  But even then, you still have the option of testing to
 find out whether you can update or not.  If you can't update, then don't
 until you can figure out how to do it.  FWIW, I think our approach is much
 more proactive than see what happens.  I'd argue, that in the past, our
 approach was see what happens, only the seeing didn't happen until after
 the release!

 -Grant
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
Robert ... I'm sorry but changes to Analyzers don't *force* people to
reindex. They can simply choose not to use the latest version. They can
choose not to upgrade a Unicode version. They can copy the entire Analyzer
code to match their needs. Index format changes is what I'm worried about
because that *forces* people to reindex.

Analyzers, believe it or not, are just a tool, an out of the box tool even,
we're giving users to analyze their stuff. Probably a tool used by most of
our users, but not all. Some have their own tools, that are currently
wrapped as a Lucene Analyzer just because the API mandates. But we were
talking about that too recently no? Ripping Analyzer off IndexWriter?

Just to be clear - I think your work on Analyzers is fantastic ! Really !
Seriously !
But it's a choice someone can make ... whereas index format is a given - you
have to live with it, or never upgrade Lucene.

But I think we've chewed that way too much. I am all for removing bw on
Analyzers, and 2396 is a great step towards it (or maybe it is IT?). Even
index format - I don't see when it will change next (but I think I have an
idea ...), so we can tackle it then.

Shai

On Thu, Apr 15, 2010 at 11:33 PM, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 4:21 PM, Shai Erera ser...@gmail.com wrote:

 Actually, I'd like to know if people like Robert (basically those who have
 no problem to reindex and don't understand the fuss around it) will want to
 change the index format - can I count on them to be asked to provide such
 tool? That's to me a policy we should decide on ... whatever the
 consequences.


 just look at the 1.8MB of backwards compat code in contrib/analyzers i want
 to remove in LUCENE-2396?
 are you serious? I wrote most of that cruft to prevent reindexing and you
 are trying to say I don't understand the fuss about it?

 We shouldnt make people reindex, but we should have the chance, even if we
 only do it ONE TIME, to reset Lucene to a new Major Version that has a
 bunch of stuff fixed we couldnt fix before, and more flexibility.

 because with the current policy, its like we are in 1.x forever our
 version numbers are a joke!
 --
 Robert Muir
 rcm...@gmail.com



Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
By all means Robert ... by all means :). Remember who started that thread,
and for what reason :D.

Shai

On Fri, Apr 16, 2010 at 12:01 AM, Robert Muir rcm...@gmail.com wrote:

 If you really believe this. then you have no problem if i remove all
 Version from all core and contrib analyzers right now.

 On Thu, Apr 15, 2010 at 4:50 PM, Shai Erera ser...@gmail.com wrote:

 Robert ... I'm sorry but changes to Analyzers don't *force* people to
 reindex. They can simply choose not to use the latest version. They can
 choose not to upgrade a Unicode version. They can copy the entire Analyzer
 code to match their needs. Index format changes is what I'm worried about
 because that *forces* people to reindex.

 Analyzers, believe it or not, are just a tool, an out of the box tool
 even, we're giving users to analyze their stuff. Probably a tool used by
 most of our users, but not all. Some have their own tools, that are
 currently wrapped as a Lucene Analyzer just because the API mandates. But we
 were talking about that too recently no? Ripping Analyzer off IndexWriter?

 Just to be clear - I think your work on Analyzers is fantastic ! Really !
 Seriously !
 But it's a choice someone can make ... whereas index format is a given -
 you have to live with it, or never upgrade Lucene.

 But I think we've chewed that way too much. I am all for removing bw on
 Analyzers, and 2396 is a great step towards it (or maybe it is IT?). Even
 index format - I don't see when it will change next (but I think I have an
 idea ...), so we can tackle it then.

 Shai


 On Thu, Apr 15, 2010 at 11:33 PM, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 4:21 PM, Shai Erera ser...@gmail.com wrote:

 Actually, I'd like to know if people like Robert (basically those who
 have no problem to reindex and don't understand the fuss around it) will
 want to change the index format - can I count on them to be asked to 
 provide
 such tool? That's to me a policy we should decide on ... whatever the
 consequences.


 just look at the 1.8MB of backwards compat code in contrib/analyzers i
 want to remove in LUCENE-2396?
 are you serious? I wrote most of that cruft to prevent reindexing and you
 are trying to say I don't understand the fuss about it?

 We shouldnt make people reindex, but we should have the chance, even if
 we only do it ONE TIME, to reset Lucene to a new Major Version that has a
 bunch of stuff fixed we couldnt fix before, and more flexibility.

 because with the current policy, its like we are in 1.x forever our
 version numbers are a joke!
 --
 Robert Muir
 rcm...@gmail.com





 --
 Robert Muir
 rcm...@gmail.com



Re: Proposal about Version API relaxation

2010-04-15 Thread Shai Erera
DM I think ICU is great. But currently we use JFlex and you can run Java 10
if you want, but as long as JFlex is compiled w/ Java 1.4, that's what
you'll get. Luckily Uwe and Robert recently bumped it up to Java 1.5. Such a
change should be clearly documented in CHANGES so people are aware of this,
and at least until they figure out what they want to do with it, they should
take the pre-3.1 analyzers (assuming that's the next release w/ JFlex 1.5
tokenizers) and use them.

Alternatively, we can think of writing an ICU analyzer/tokenizer, but we're
still using JFlex, so I don't know how much control we have on that ...

Shai

On Fri, Apr 16, 2010 at 12:21 AM, DM Smith dmsmith...@gmail.com wrote:


 On Apr 15, 2010, at 4:50 PM, Shai Erera wrote:

  Robert ... I'm sorry but changes to Analyzers don't *force* people to
 reindex. They can simply choose not to use the latest version. They can
 choose not to upgrade a Unicode version. They can copy the entire Analyzer
 code to match their needs. Index format changes is what I'm worried about
 because that *forces* people to reindex.

 In several threads and issues it has been pointed out that upgrading
 Unicode versions is not an obvious choice or even controllable. It is
 dictated by the version of Java, the version of the OS and any Unicode
 specific libraries.

 A desktop application which internally uses lucene has no control over the
 automatic update of Java (yes it can detect the version change and refuse to
 run or force an upgrade) or when the user feels like upgrading the OS (not
 sure how to detect the Unicode version of an arbitrary OS. Not sure I want
 to).

 Even with server applications, some shared servers have one version of Java
 that all use. And the owner of an individual application might have no say
 in if or when that is upgraded.

 This is to say that one needs to be ready to re-index at all times unless
 it can be controlled.

 One way to handle the Java/Unicode is to use ICU at a specific version and
 control its upgrade.

 One way to handle the OS problem (which really is one of user input) is to
 keep up with the changes to Unicode and create a filter that handles the
 differences normalizing to the Unicode version of the index (if that's even
 possible).

 Still goes to your point. The onus is on the application not on Lucene.

 -- DM
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Created: (LUCENE-2397) SnapshotDeletionPolicy.snapshot() throws NPE if no commits happened

2010-04-15 Thread Shai Erera (JIRA)
SnapshotDeletionPolicy.snapshot() throws NPE if no commits happened
---

 Key: LUCENE-2397
 URL: https://issues.apache.org/jira/browse/LUCENE-2397
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1


SDP throws NPE if no commits occurred and snapshot() was called. I will replace 
it w/ throwing IllegalStateException. I'll also move TestSDP from o.a.l to 
o.a.l,index. I'll post a patch soon

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #1157

2010-04-15 Thread Shai Erera
DB jars again ... I think this one is a false alarm.

Shai

On Fri, Apr 16, 2010 at 5:14 AM, Apache Hudson Server 
hud...@hudson.zones.apache.org wrote:

 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1157/changes

 Changes:

 [mikemccand] speed up TestStressIndexing2

 --
 [...truncated 4473 lines...]

 jflex-notice:

 javacc-uptodate-check:

 javacc-notice:

 init:

 clover.setup:

 clover.info:

 clover:

 common.compile-core:

 compile-core:

 compile-demo:
[mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/classes/demo
 
[javac] Compiling 17 source files to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/classes/demo
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

 compile-memory:
 [echo] Building memory...

 common.init:

 build-lucene:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:
[mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/classes/java
 
[javac] Compiling 1 source file to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/classes/java
 
[javac] Note: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java
 uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

 jar-core:
  [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/lucene-memory-2010-04-16_02-03-48.jar
 

 default:

 compile-highlighter:
 [echo] Building highlighter...

 build-memory:

 build-queries:
 [echo] Highlighter building dependency contrib/queries
 [echo] Building queries...

 common.init:

 build-lucene:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:
[mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/classes/java
 
[javac] Compiling 18 source files to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/classes/java
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

 jar-core:
  [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2010-04-16_02-03-48.jar
 

 default:

 common.init:

 build-lucene:

 init:

 clover.setup:

 clover.info:

 clover:

 common.compile-core:
[mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/java
 
[javac] Compiling 35 source files to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/java
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

 compile-core:

 jar-core:
  [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/lucene-highlighter-2010-04-16_02-03-48.jar
 

 default:

 compile-analyzers-common:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:
[mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/java
 
[javac] Compiling 106 source files to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/java
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/quality/trec/TrecJudge.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

 jar-core:
  [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/lucene-benchmark-2010-04-16_02-03-48.jar
 

 jar:

 compile-test:
 [echo] Building benchmark...

 common.init:

 compile-demo:

 jflex-uptodate-check:

 jflex-notice:

 javacc-uptodate-check:

 javacc-notice:

 init:

 clover.setup:

 

[jira] Resolved: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-04-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2316.


Lucene Fields: [New, Patch Available]  (was: [New])
 Assignee: Shai Erera
   Resolution: Fixed

Committed revision 933879.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2316.patch


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856845#action_12856845
 ] 

Shai Erera commented on LUCENE-2159:


This looks like a nice tool. But all it does is create multiple copies of the 
same segment(s) right? So what exactly do you want to test with it? What 
worries me is that we'll be multiplying the lexicon, posting lists, statistics 
etc., therefore I'm not sure how reliable the tests will be (whatever they 
are), except for measuring things related to large number of segments (like 
merge performance). Am I right?

I also think this class better fits in benchmark rather than misc, as it's 
really for perf. testing/measurements and not as a generic utility ... You can 
create a Task out if it, like ExpandIndexTask which one can include in his 
algorithm.

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856877#action_12856877
 ] 

Shai Erera commented on LUCENE-2159:


bq. I understand having a general performance suite to test regression is a 
good thing. But we found having a more focused test for segmentation and merge 
is important.

Are you saying that because of the benchmark proposal? I still think that an 
ExpandIndexTask will be useful for benchmark and fits better there, than in 
contrib/misc. We can have that task together w/ a predefined .alg for using it 
...

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856911#action_12856911
 ] 

Shai Erera commented on LUCENE-2159:


Which is fine - I think this would be a neat task to add to benchmark, w/ 
specific documentation on how to use it and for what purposes. If you can also 
write a sample .alg file which e.g. creates a small index and then Expand it, 
that'd be great.

I've looked at the different PerfTask implementations in benchmark, and I'm 
thinking if we perhaps should do the following:
* Create an AddIndexesTask which receives one or more Directories as input and 
calls writer.addIndexesNoOptimize
* If one wants, he can add an OptimizeTask call afterwards.
* Write an expandIndex.alg which initially creates an index of size N from one 
content source and then calls the AddIndexesTask several times. The .alg file 
is meant to be an example as well as people can change it to create bigger or 
smaller indexes, use other content sources and switch between RAM/FS 
directories.

How's that sound?

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856917#action_12856917
 ] 

Shai Erera commented on LUCENE-2159:


bq. There is an excellent section on it in LIA2

Indeed !

Ok so to create a task, you just extend PerfTask. You can look under 
contrib/benchmark/src/java/o.a.l/benchmark/byTask/tasks for many examples. 
OptimizeTask seems relevant here (i.e. it calls an IW API and receives a 
parameter).

For writing .alg files, that's SUPER simple, just look under 
contrib/benchmark/conf for many existing examples. You can post a patch once 
you feel comfortable enough with it and I can help you with the struggles (if 
you'll run into any). Another great source (besides LIA2) on writing .alg files 
is the package.html under 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask.

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera
Ahh ... a dream finally comes true ... what a great way to start a day :).
+1 !!!

I have some questions/comments though:

* Index back compat should be maintained between major releases, like it is
today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
call optimize() to ensure 4.0 still works on their index. I hope that will
still be the case? Otherwise I don't see how we can prevent reindexing by
apps.
** Index behavioral/runtime changes, like those of Analyzers, are ok to
require a reindex, as proposed.

So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
API? Cool and convenient. For how long do we keep the 3.1 branch around?
Also, it used to only fix bugs, but from now on it'll be allowed to
introduce new features, if they maintain back-compat? So 3.1.1 can have
'flex' (going for the extreme on purpose) if someone maintains back-compat?

I think the back-compat on branches should be only for index runtime
changes. There's no point, in my opinion, to maintain API back-compat
anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
just to get a new feature but get it API back-supported? As soon as they
upgrade to 3.2, that means a new set of API right?

Major releases will just change the index structure format then? Or move to
Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
Java 1.6 ... no API back-compat right :).

That's definitely a great step forward !

Shai

On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org wrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

 Thoughts...?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/?? ? (ear...@gmail.com)

 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera
Also, we will still need to maintain the Backwards section in CHANGES (or
move it to API Changes), to help people upgrade from release to release.
Just pointing that out as well.

Shai

On Thu, Apr 15, 2010 at 7:05 AM, Shai Erera ser...@gmail.com wrote:

 Ahh ... a dream finally comes true ... what a great way to start a day :).
 +1 !!!

 I have some questions/comments though:

 * Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
 segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
 call optimize() to ensure 4.0 still works on their index. I hope that will
 still be the case? Otherwise I don't see how we can prevent reindexing by
 apps.
 ** Index behavioral/runtime changes, like those of Analyzers, are ok to
 require a reindex, as proposed.

 So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
 API? Cool and convenient. For how long do we keep the 3.1 branch around?
 Also, it used to only fix bugs, but from now on it'll be allowed to
 introduce new features, if they maintain back-compat? So 3.1.1 can have
 'flex' (going for the extreme on purpose) if someone maintains back-compat?

 I think the back-compat on branches should be only for index runtime
 changes. There's no point, in my opinion, to maintain API back-compat
 anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
 just to get a new feature but get it API back-supported? As soon as they
 upgrade to 3.2, that means a new set of API right?

 Major releases will just change the index structure format then? Or move to
 Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
 Java 1.6 ... no API back-compat right :).

 That's definitely a great step forward !

 Shai


 On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.orgwrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

 Thoughts...?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/?? ? (ear...@gmail.com)

 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera
So then I don't understand this:

{quote}
* A major release always bumps the major release number (2.x -
   3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
   releases along that branch

* There is no back compat across major releases (index nor APIs),
   but full back compat within branches.

{quote}

What's different than what's done today? How can we remove Version in that
world, if we need to maintain full back-compat between 3.1 and 3.2, index
and API-wise? We'll still need to deprecate and come up w/ new classes every
time, and we'll still need to maintain runtime changes back-compat.

Unless you're telling me we'll start releasing major releases more often?
Well ... then we're saying the same thing, only I think that instead of
releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
because if you look back, every minor release included API deprecations as
well as back-compat breaks. That means that every minor release should have
been a major release right?

Point is, if I understand correctly and you agree w/ my statement above - I
don't see why would anyone releases a 3.x after 4.0 is out unless someone
really wants to work hard on maintaining back-compat of some features.

If it's just a numbering thing, then I don't think it matters what is
defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor
X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer
the latter but don't have any strong feelings against the former. Just
pointing out that X will grow more rapidly than today. That's all.

So did I get it right?

Shai

On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote:

 I don't read what you wrote and what Mike wrote as even close to the same.

 - Mark

 http://www.lucidimagination.com (mobile)

 On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote:

 Ahh ... a dream finally comes true ... what a great way to start a day :).
 +1 !!!

 I have some questions/comments though:

 * Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
 segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
 call optimize() to ensure 4.0 still works on their index. I hope that will
 still be the case? Otherwise I don't see how we can prevent reindexing by
 apps.
 ** Index behavioral/runtime changes, like those of Analyzers, are ok to
 require a reindex, as proposed.

 So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
 API? Cool and convenient. For how long do we keep the 3.1 branch around?
 Also, it used to only fix bugs, but from now on it'll be allowed to
 introduce new features, if they maintain back-compat? So 3.1.1 can have
 'flex' (going for the extreme on purpose) if someone maintains back-compat?

 I think the back-compat on branches should be only for index runtime
 changes. There's no point, in my opinion, to maintain API back-compat
 anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
 just to get a new feature but get it API back-supported? As soon as they
 upgrade to 3.2, that means a new set of API right?

 Major releases will just change the index structure format then? Or move to
 Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
 Java 1.6 ... no API back-compat right :).

 That's definitely a great step forward !

 Shai

 On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda  va...@osafoundation.org
 va...@osafoundation.org wrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
  luc...@mikemccandless.comluc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
  mar...@rectangular.commar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when

[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2386.


Resolution: Fixed

Committed revision 933613. (take #2)

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Proposal about Version API relaxation

2010-04-13 Thread Shai Erera
Hi

I'd like to propose a relaxation on the Version API. Uwe, please read the
entire email before you reply :).

I was thinking, following a question on the user list, that the
Version-based API may not be very intuitive to users, especially those who
don't care about versioning, as well as very inconvenient. So there are two
issues here:
1) How should one use Version smartly so that he keeps backwards
compatibility. I think we all know the answer, but a Wiki page with some
best practices tips would really help users use it.
2) How can one write sane code, which doesn't pass versions all over the
place if: (1) he doesn't care about versions, or (2) he cares, and sets the
Version to the same value in his app, in all places.

Also, I think that today we offer a flexibility to users, to set different
Versions on different objects in the life span of their application - which
is a good flexibility but can also lead people to shoot themselves in the
legs if they're not careful -- e.g. upgrading Version across their app, but
failing to do so for one or two places ...

So the change I'd like to propose is to mostly alleviate (2) and better
protect users - I DO NOT PROPOSE TO GET RID OF Version :).

I was thinking that we can add on Version a DEFAULT version, which the
caller can set. So Version.setDefault and Version.getDefault will be added,
as static members (more on the static-ness of it later). We then change the
API which requires Version to also expose an API which doesn't require it,
and that API will call Version.getDefault(). People can use it if they want
to ...

Few points:
1) As a default DEFAULT Version is controversial, I don't want to propose
it, even though I think Lucene can define the DEFAULT to be the latest.
Instead, I propose that Version.getDefault throw a
DefaultVersionNotSetException if it wasn't set, while an API which relies on
the default Version is called (I don't want to return null, not sure how
safe it is).
2) That DEFAULT Version is static, which means it will affect all indexing
code running inside the JVM. Which is fine:
2.1) Perhaps all the indexing code should use the same Version
2.2) If you know that's not the case, then pass Version to the API which
requires it - you cannot use the 'default Version' API -- nothing changes
for you.
One case is missing -- you might not know if your code is the only indexing
code which runs in the JVM ... I don't have a solution to that, but I think
it'll be revealed pretty quickly, and you can change your code then ...

So to summarize - the current Version API will remain and people can still
use it. The DEFAULT Version API is meant for convenience for those who don't
want to pass Version everywhere, for the reasons I outlined above. This will
also clean our test code significantly, as the tests will set the DEFAULT
version to TEST_VERSION_CURRENT at start ...

The changes to the Version class will be very simple.

If people think that's acceptable, I can open an issue and work on it.

Shai


[jira] Updated: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-04-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2316:
---

Attachment: LUCENE-2316.patch

Patch clarifies the contract, fixes the directories to adhere to it and adds a 
CHANGES under backwards section. All tests pass.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2316.patch


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera
Well the no-arg ctor will be using Version.getDefault() which will
throw an exception if not set, and delegate the call to the
Version-aware ctor.

On Tuesday, April 13, 2010, Robert Muir rcm...@gmail.com wrote:
 On Tue, Apr 13, 2010 at 11:27 AM, Shai Erera ser...@gmail.com wrote:


 I was thinking that we can add on Version a DEFAULT version, which the caller 
 can set. So Version.setDefault and Version.getDefault will be added, as 
 static members (more on the static-ness of it later). We then change the API 
 which requires Version to also expose an API which doesn't require it, and 
 that API will call Version.getDefault(). People can use it if they want to ...

 I don't understand how this works... if Something has a no-arg ctor today, 
 and i want to improve it in a backwards-compatible way, how will this work?
 the way this works today, lets say while working with 3.1 is:

 Something() is deprecated, and invokes Something(3.0)Something(Version) is 
 added, and emulates the old behavior for  3.1, and the new behavior for = 
 3.1
 i dont see how backwards compatibility will work with this proposal, since 
 the no-arg ctor would then emulate some random behavior depending on a static.


 --
 Robert Muir
 rcm...@gmail.com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera
 That is a static default!

Yes Uwe ... I'm aware of that :)
But that's not a static default for Lucene ... only for the application, if
it chooses to use it ...

 so there are no plans to reimplement such a thing again

Well ... that's not exactly what I'm proposing here. I'm not for
re-implementing any sort of staticness, unless the app chooses to use it.
And please don't give me that 'there are no plans ...' answer - it kind of
kills the discussion, which is not healthy for a community.

I agree that static variables might cause troubles to some deployments, BUT:

1) Not all apps are deployed on a Web Server together with other apps who
happen to use Lucene.
2) Those that are deployed on web servers usually include lucene.jar in
their classpath and are loaded by a different class loader than the rest ...

So we're really talking about deployments where Lucene is a common, shared
library between all apps ...

And I guess that what bothers me the most is that it feels to me like we're
trying to protect people from stuff we haven't yet received complaints on
(at least none that I'm aware of), while we're hurting the programming
experience of others ... almost recklessly. I'd hope we can find a way
around that, because today I pass the same Version value around everywhere,
and it's simply inconvenient. Just yesterday people complained about the
need to call writer.commit() after new IW() if they want to open a reader
... one-liner inconvenience vs. dozen of lines here -- point is, what's
perceived as unnecessary code DOES bother people ... only here it's just a
setting thing, and my proposal is not to make it generically static. So
let's not get caught on that 'static-ness'. And besides, if you ask me
- variables
like Version, that are needed in so many places, are usually made static ...
but not in Lucene ...

So if possible ... I'd like to think how we can fix/improve the use of
Version, in ways that won't break apps. Because the fact to the matter is -
we invented Version to allow for changes w/o breaking back-compat, while the
backwards section in CHANGES seems to grow from release to release (I know -
I'm partly to blame for it :)), and another fact is that I don't remember
even one complaint about a change which broke back-compat. People have
raised this issue numerous times in the past, even proposed all sorts of
contracts and definitions on how we can be 'allowed' to break back-compat
... but nothing came out of it.

The fact that we are not able to reach consensus doesn't mean the problem
doesn't bother many out there. And ignoring the fact that currently the API
looks cluttered is not doing any good. There must be away to allow some apps
out there (IMO the majority) to set that Version thing once, and let Lucene
use that value everywhere else ... while for others to pass it along as much
as they want.

Shai

On Tue, Apr 13, 2010 at 7:41 PM, Uwe Schindler u...@thetaphi.de wrote:

  Hi Shai,



 one of the problem I have is: That is a static default! We want to get rid
 of them (and did it mostly, only some relicts remain), so there are no plans
 to reimplement such a thing again. The badest one is
 BooleanQuery.maxClauseCount. The same applies to all types of sysprops. As
 Lucene and solr is mostly running in servlet containers, this type of thing
 makes web applications no longer isolated. This is also a general contract
 for libraries: never ever rely on sysprops or statics.



 Uwe



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Shai Erera [mailto:ser...@gmail.com]
 *Sent:* Tuesday, April 13, 2010 5:27 PM
 *To:* java-dev@lucene.apache.org
 *Subject:* Proposal about Version API relaxation



 Hi

 I'd like to propose a relaxation on the Version API. Uwe, please read the
 entire email before you reply :).

 I was thinking, following a question on the user list, that the
 Version-based API may not be very intuitive to users, especially those who
 don't care about versioning, as well as very inconvenient. So there are two
 issues here:
 1) How should one use Version smartly so that he keeps backwards
 compatibility. I think we all know the answer, but a Wiki page with some
 best practices tips would really help users use it.
 2) How can one write sane code, which doesn't pass versions all over the
 place if: (1) he doesn't care about versions, or (2) he cares, and sets the
 Version to the same value in his app, in all places.

 Also, I think that today we offer a flexibility to users, to set different
 Versions on different objects in the life span of their application - which
 is a good flexibility but can also lead people to shoot themselves in the
 legs if they're not careful -- e.g. upgrading Version across their app, but
 failing to do so for one or two places ...

 So the change I'd like to propose is to mostly alleviate (2) and better
 protect users - I DO NOT PROPOSE TO GET RID OF Version :).

 I was thinking

Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera
 Because the version mechanism is not a single value for the entire library
but rather feature by feature. I don't see how a global setter can help.

That's only true if we believe people use different Version values in
different places of their code ... and note that they will still be able to.
I'm not proposing to take out Version from the ctors, just to add an
additional default-version the app can set and use.So if the app doesn't
want to do it .. it doesn't have to.

Shai

On Tue, Apr 13, 2010 at 9:40 PM, DM Smith dmsmith...@gmail.com wrote:

 I like the concept of version, but I'm concerned about it too.

 The current Version mechanism allows one to use more than one Version in
 their code. Imagine that we are at 3.2 and one was unable to upgrade to a
 most version for a particular feature. Let's also suppose that at 3.2 a new
 feature was introduced and was taken advantage of. But at 3.5 that new
 feature is versioned but one is unable to upgrade for it, too. Now what? Use
 3.0 for the one feature and 3.2 for the other?

 What about the interoperability of versioned features? Does a version 3.0
 class play well with a 3.2 versioned class? How do we test that?

 A long term issue is that of bw compat for the version itself. The bw
 compat contract is two fold: API and index. The API has a shorter lifetime
 of compatibility than that of an index. How does one deprecate a particular
 version for the api but not the index? How does one know whether one
 versioned feature impacts the index and an other does not?

 I'm hoping that I'm imagining a problem that will never actually arise.

 Shai, to your suggestion: Because the version mechanism is not a single
 value for the entire library but rather feature by feature. I don't see how
 a global setter can help.

 -- DM


 On 04/13/2010 11:27 AM, Shai Erera wrote:

 Hi

 I'd like to propose a relaxation on the Version API. Uwe, please read the
 entire email before you reply :).

 I was thinking, following a question on the user list, that the
 Version-based API may not be very intuitive to users, especially those who
 don't care about versioning, as well as very inconvenient. So there are two
 issues here:
 1) How should one use Version smartly so that he keeps backwards
 compatibility. I think we all know the answer, but a Wiki page with some
 best practices tips would really help users use it.
 2) How can one write sane code, which doesn't pass versions all over the
 place if: (1) he doesn't care about versions, or (2) he cares, and sets the
 Version to the same value in his app, in all places.

 Also, I think that today we offer a flexibility to users, to set different
 Versions on different objects in the life span of their application - which
 is a good flexibility but can also lead people to shoot themselves in the
 legs if they're not careful -- e.g. upgrading Version across their app, but
 failing to do so for one or two places ...

 So the change I'd like to propose is to mostly alleviate (2) and better
 protect users - I DO NOT PROPOSE TO GET RID OF Version :).

 I was thinking that we can add on Version a DEFAULT version, which the
 caller can set. So Version.setDefault and Version.getDefault will be added,
 as static members (more on the static-ness of it later). We then change the
 API which requires Version to also expose an API which doesn't require it,
 and that API will call Version.getDefault(). People can use it if they want
 to ...

 Few points:
 1) As a default DEFAULT Version is controversial, I don't want to propose
 it, even though I think Lucene can define the DEFAULT to be the latest.
 Instead, I propose that Version.getDefault throw a
 DefaultVersionNotSetException if it wasn't set, while an API which relies on
 the default Version is called (I don't want to return null, not sure how
 safe it is).
 2) That DEFAULT Version is static, which means it will affect all indexing
 code running inside the JVM. Which is fine:
 2.1) Perhaps all the indexing code should use the same Version
 2.2) If you know that's not the case, then pass Version to the API which
 requires it - you cannot use the 'default Version' API -- nothing changes
 for you.
 One case is missing -- you might not know if your code is the only
 indexing code which runs in the JVM ... I don't have a solution to that, but
 I think it'll be revealed pretty quickly, and you can change your code then
 ...

 So to summarize - the current Version API will remain and people can still
 use it. The DEFAULT Version API is meant for convenience for those who don't
 want to pass Version everywhere, for the reasons I outlined above. This will
 also clean our test code significantly, as the tests will set the DEFAULT
 version to TEST_VERSION_CURRENT at start ...

 The changes to the Version class will be very simple.

 If people think that's acceptable, I can open an issue and work on it.

 Shai



 -
 To unsubscribe, e

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855870#action_12855870
 ] 

Shai Erera commented on LUCENE-2386:


I'm not sure if we're arguing about the same thing here ... why when I open an 
IW on empty Directory I need an empty segment that's created, and from now on 
never changed, populated or even read? That just seems wrong to me ... when I 
fixed the tests to not rely on the buggy behavior, I noticed several which 
count the list of commits (especially the IDP ones) w/ a documentation like 1 
for opening + N for committing ...

It just looks weird that when you open IW a commit happens, a set of empty 
files are created, but from now on they are never modified, until IDP kicks in, 
after the second commit ... it's nothing like initing the Directory to be able 
to receive input ..

And I don't know what's the benefit of doing new IW() following by 
IR.open() ... that IR will always see 0 documents, until you call reopen (if 
commit happened in between). So what's the convenience here? that your code can 
call IR.open once, and from that point forward just 'reopen()'? That seems low 
advantage to me, really. Maybe what we should do is fix IR.open to return a 
null IR in case the directory hasn't been populated w/ anything yet. Then you 
can check easily if you should call open() (==null) or reopen (otherwise). Or 
create a blank stub of IR which emulates an empty Dir, and when reopen is 
called works well (if the Directory is not empty now) ...

BTW, FWIW, Solr's code did not break from this change at all ... it was the 
combination of FSDir and NoLF/SingleInstanceLF that broke some tests that used 
it ... I don't know how many apps out there are using that combination, but I'd 
bet it's small? I use that combination, however in my case an IR is opened only 
after a commit signal/event is raised (so I don't check isCurrent often or 
attempt to reopen()). What I'm trying to say is that this combination is 
dangerous, and the application needs to ensure that only one IW is open at any 
given time, and I'm sure such apps are more sophisticated then opening IW and 
then IR just for the convenience of it.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855873#action_12855873
 ] 

Shai Erera commented on LUCENE-2316:


Well ... dir.fileLength is also used by SegmentInfos.sizeInBytes to compute the 
size of all the files in the Directory. If we remove fileLength, then SI will 
need to call dir.openInput.length() and the close it? Seems like a lot of work 
to me, for just obtaining the length of the file. So I agree that if you have 
an IndexInput at hand, you should call its length() method rather than 
Dir.fileLength. But otherwise, if you just have a name at hand, a 
dir.fileLength is convenient?

I'm also ok w/ the bw break rather than going through the new/deprecate cycle.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2392) Enable flexible scoring

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875
 ] 

Shai Erera commented on LUCENE-2392:


Mike - it'll also be great if we can store the length of the document in a 
custom way. I think what I'm saying is that if we can open up the norms 
computation to custom code - that will do what I want, right? Maybe we can have 
a class like DocLengthProvider which apps can plug in if they want to customize 
how that length is computed. Wherever we write the norms, we'll call that impl, 
which by default will do what Lucene does today?
I think though that it's not a field-level setting, but an IW one?

 Enable flexible scoring
 ---

 Key: LUCENE-2392
 URL: https://issues.apache.org/jira/browse/LUCENE-2392
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2392.patch


 This is a first step (nowhere near committable!), implementing the
 design iterated to in the recent Baby steps towards making Lucene's
 scoring more flexible java-dev thread.
 The idea is (if you turn it on for your Field; it's off by default) to
 store full stats in the index, into a new _X.sts file, per doc (X
 field) in the index.
 And then have FieldSimilarityProvider impls that compute doc's boost
 bytes (norms) from these stats.
 The patch is able to index the stats, merge them when segments are
 merged, and provides an iterator-only API.  It also has starting point
 for per-field Sims that use the stats iterator API to compute boost
 bytes.  But it's not at all tied into actual searching!  There's still
 tons left to do, eg, how does one configure via Field/FieldType which
 stats one wants indexed.
 All tests pass, and I added one new TestStats unit test.
 The stats I record now are:
   - field's boost
   - field's unique term count (a b c a a b -- 3)
   - field's total term count (a b c a a b -- 6)
   - total term count per-term (sum of total term count for all docs
 that have this term)
 Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855877#action_12855877
 ] 

Shai Erera commented on LUCENE-2373:


I'd rather not count on file length as well ... so a put/getTermDictSize method 
on Codec will allow one to implement it however one wants, if running on HDFS 
for example?

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring

2010-04-12 Thread Shai Erera
I'm not sure Robert where did I propose to shove random statistics into the
index? Lucene calculated a doc length today which some in the
academy/research here disagree w/ how it's done. So instead of attempting to
fix it for all, I think it'd be great if one can define what is the doc
Length as one perceives it. Why is that problematic?

What Mike opened is an issue titled enable flexible scoring ... what I'm
asking for falls under that hood?

Also, maybe we should have that discussion on the issue?

Shai

On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir rcm...@gmail.com wrote:

 I disagree. I think what Mike has defined here is way beyond a baby-step:
 its all the stats needed to support modern IR models in Lucene: BM25,
 additional vector space algorithms, divergence from randomness, and language
 modelling.

 I think the ability to calculate your own random statistics and shove them
 into the index (this would be messy like how to get access to the aggregates
 you need anyway) is something different entirely, best left to research
 systems.

 You can't even do that with Terrier now.

 On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875]

 Shai Erera commented on LUCENE-2392:
 

 Mike - it'll also be great if we can store the length of the document in a
 custom way. I think what I'm saying is that if we can open up the norms
 computation to custom code - that will do what I want, right? Maybe we can
 have a class like DocLengthProvider which apps can plug in if they want to
 customize how that length is computed. Wherever we write the norms, we'll
 call that impl, which by default will do what Lucene does today?
 I think though that it's not a field-level setting, but an IW one?

  Enable flexible scoring
  ---
 
  Key: LUCENE-2392
  URL: https://issues.apache.org/jira/browse/LUCENE-2392
  Project: Lucene - Java
   Issue Type: Improvement
   Components: Search
 Reporter: Michael McCandless
 Assignee: Michael McCandless
  Fix For: 3.1
 
  Attachments: LUCENE-2392.patch
 
 
  This is a first step (nowhere near committable!), implementing the
  design iterated to in the recent Baby steps towards making Lucene's
  scoring more flexible java-dev thread.
  The idea is (if you turn it on for your Field; it's off by default) to
  store full stats in the index, into a new _X.sts file, per doc (X
  field) in the index.
  And then have FieldSimilarityProvider impls that compute doc's boost
  bytes (norms) from these stats.
  The patch is able to index the stats, merge them when segments are
  merged, and provides an iterator-only API.  It also has starting point
  for per-field Sims that use the stats iterator API to compute boost
  bytes.  But it's not at all tied into actual searching!  There's still
  tons left to do, eg, how does one configure via Field/FieldType which
  stats one wants indexed.
  All tests pass, and I added one new TestStats unit test.
  The stats I record now are:
- field's boost
- field's unique term count (a b c a a b -- 3)
- field's total term count (a b c a a b -- 6)
- total term count per-term (sum of total term count for all docs
  that have this term)
  Still need at least the total term count for each field.

 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
 https://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see: http://www.atlassian.com/software/jira



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 Robert Muir
 rcm...@gmail.com



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855892#action_12855892
 ] 

Shai Erera commented on LUCENE-2386:


bq. what is the proper way (after this fix) to open an IR over possibly-empty 
directory? 

You can simply call commit() immediately after you open IW. If that's what you 
need then it will work for you.

You're right that if I add docs, deletes and them commits, I'll get an empty 
segment. So is if you do new IW() and then iw.close() w/ no addDocument in 
between. The point here was that we should not create a commit unless the user 
has specifically asked for it. Calling close() means asking for a commit, per 
close semantics and contract. But if the app called new IW, add docs and 
crashed in the middle, the Directory will still remain empty ... which is sort 
of what, IMO, should happen.

I agree it's a matter of perspective. I think that when autoCommit was removed, 
so should have been this code. I don't know if it was left behind for a good 
reason, or simply because when someone tried to do it, he found out it's not 
that simple (like I have :)).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2392) Enable flexible scoring

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855913#action_12855913
 ] 

Shai Erera commented on LUCENE-2392:


I'd like to withdraw my request from above. I misunderstood that the stats I 
need are stored per-field per-doc. So that will allow me to compute the 
docLength as I want.

 Enable flexible scoring
 ---

 Key: LUCENE-2392
 URL: https://issues.apache.org/jira/browse/LUCENE-2392
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2392.patch


 This is a first step (nowhere near committable!), implementing the
 design iterated to in the recent Baby steps towards making Lucene's
 scoring more flexible java-dev thread.
 The idea is (if you turn it on for your Field; it's off by default) to
 store full stats in the index, into a new _X.sts file, per doc (X
 field) in the index.
 And then have FieldSimilarityProvider impls that compute doc's boost
 bytes (norms) from these stats.
 The patch is able to index the stats, merge them when segments are
 merged, and provides an iterator-only API.  It also has starting point
 for per-field Sims that use the stats iterator API to compute boost
 bytes.  But it's not at all tied into actual searching!  There's still
 tons left to do, eg, how does one configure via Field/FieldType which
 stats one wants indexed.
 All tests pass, and I added one new TestStats unit test.
 The stats I record now are:
   - field's boost
   - field's unique term count (a b c a a b -- 3)
   - field's total term count (a b c a a b -- 6)
   - total term count per-term (sum of total term count for all docs
 that have this term)
 Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855924#action_12855924
 ] 

Shai Erera commented on LUCENE-2386:


I don't think that people need to write that emptiness-detection-then-commit 
code ... if they care, they can simply immediately call commit() after they 
open IW.

bq. Isn't opening IW with CREATE* mode called specifically asking for?

It depends on how you interpret the mode ... for example, you cannot pass 
OpenMode.APPEND for an empty Directory, because IW throws an exception. The 
modes are just meant to tell IW how to behave:
* APPEND - I know there is an index in the Directory, and I'd like to append to 
it.
* CREATE - I don't care if there is an index in the Directory -- create a new 
one, zeroing out all segments.
* CREATE_OR_APPEND - If there is an index, open it, otherwise create a new one.

So if you pass CREATE on an already populated index, IW doesn't do the implicit 
commit, until you call commit() yourself. But if you pass CREATE on an empty 
index, IW suddenly calls commit()? That's just an inconsistency that's meant to 
allow you to open an IR immediately after new IW() call, irregardless of what 
was there? And if you open that IR, then if the index was populated you see the 
previous set of documents, but if it wasn't you see nothing, even though you 
meant to say override what's there?

I've checked what FileOutputStream does, using the following code:
{code}
File file = new File(d:/temp/tmpfile);
FileOutputStream fos = new FileOutputStream(file);
fos.write(3);
fos.close();
  
fos = new FileOutputStream(file);
FileInputStream fis = new FileInputStream(file);
System.out.println(fis.read());
{code}

* Second line creates an empty file immediately, not waiting for close() or 
flush() -- which resembles the behavior that you're suggesting we should take 
w/ IW (which is the 'today's behavior')
* Forth line closes the file, flushing and writing the content.
* Fifth line *recreates* the file, empty, again, w/o calling close. So it zeros 
out the file content immediately, even before you wrote a single piece of byte 
to it.
* Sixth+Seventh line proves it by attempting to read from the file, and the 
output printed is -1.

I've wrapped the FOS w/ a BufferedOS and the behavior is still the same. So I'm 
trying to show is that we don't fully adhere to the CREATE mode, and rightfully 
if you ask me - we shouldn't zero out the segments until the application called 
commit(). But we choose to adhere differently to the CREATE* mode if the index 
is already populated. That's an inconsistent behavior, at least in my 
perspective. It's also harder to explain and document, e.g. you should call 
commit() if you used CREATE, in case you want to zero out everything 
immediately, and the Directory is not empty, but you don't need to call 
commit() if the directory was empty, Lucene will do it for you. -- so now how 
will the app know if it should call commit()? It will need to write a sort of 
emptiness-detection-then-commit?

I am willing to consider the following semantics:
* APPEND - assumes an index exists and open it.
* CREATE - zeros out everything that's in the directory *immediately*, and also 
prepares an empty directory.
* CREATE_OR_APPEND - either loads an existing index, or is able to work on the 
empty directory. No implicit commit is happening by IW if the index does not 
exist.

But I think CREATE is too dangerous, and so I prefer to stick w/ the proposed 
change to the patch so far -- if you open an index in CREATE*, you should call 
commit before you can read it. That will adhere to the semantics of what the 
application wanted, whether it meant to zero out an existing Directory, or 
create a new one from scratch.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856063#action_12856063
 ] 

Shai Erera commented on LUCENE-2386:


So just call new IW(), then rollback and ensure dir.listAll() returns an 
empty list? Or also index stuff, making sure a flush occurs and then rollback? 
I'm not sure that the latter is related to that issue ...

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-12 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch includes the proposed test in TestIndexWriter. I think this is ready for 
commit, if there are no more objections.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2386.


Lucene Fields: [New, Patch Available]  (was: [New])
   Resolution: Fixed

Committed revision 932868.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855713#action_12855713
 ] 

Shai Erera commented on LUCENE-1709:


Committed revision 932878 with the following:
# benchmark tests force sequential run
# threadsPerProcessor defaults to 1 and can be overridden by 
-DthreadsPerProcessor=value
# A CHANGES entry

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r932873 - /lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java

2010-04-11 Thread Shai Erera
Sorry about that ...

On Sun, Apr 11, 2010 at 3:10 PM, uschind...@apache.org wrote:

 Author: uschindler
 Date: Sun Apr 11 12:10:57 2010
 New Revision: 932873

 URL: http://svn.apache.org/viewvc?rev=932873view=rev
 Log:
 add missing license header

 Modified:

  
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java

 Modified:
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 URL:
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java?rev=932873r1=932872r2=932873view=diff

 ==
 ---
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 (original)
 +++
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 Sun Apr 11 12:10:57 2010
 @@ -1,5 +1,22 @@
  package org.apache.lucene.index;

 +/**
 + * Licensed to the Apache Software Foundation (ASF) under one or more
 + * contributor license agreements.  See the NOTICE file distributed with
 + * this work for additional information regarding copyright ownership.
 + * The ASF licenses this file to You under the Apache License, Version 2.0
 + * (the License); you may not use this file except in compliance with
 + * the License.  You may obtain a copy of the License at
 + *
 + * http://www.apache.org/licenses/LICENSE-2.0
 + *
 + * Unless required by applicable law or agreed to in writing, software
 + * distributed under the License is distributed on an AS IS BASIS,
 + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 + * See the License for the specific language governing permissions and
 + * limitations under the License.
 + */
 +
  import java.io.FileNotFoundException;

  /**





[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855727#action_12855727
 ] 

Shai Erera commented on LUCENE-2386:


Committed revision 932917 for the revert.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Fixes IndexFileDeleter, adds a proper test to TestIndexWriter. Haven't run all 
the tests yet though, but the added test passes now with the fix.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855767#action_12855767
 ] 

Shai Erera commented on LUCENE-2386:


About IndexReader.listCommits ... the javadocs state this There must be at 
least one commit in the Directory, else this method throws 
java.io.IOException.. So I'll change it to reflect the right exception type is 
thrown (IndexNotFoundException) and revert the change to DirReader.listCommits 
which returns an empty list.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch w/ proposed fixes. All tests pass, including Solr's :).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-10 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch updated to latest rev. + the proposed name change -- 
IndexNotFoundException. All tests pass. I plan to commit this later today.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855344#action_12855344
 ] 

Shai Erera commented on LUCENE-2386:


Ok I've added the following to DirReader:

{code}
try {
  latest.read(dir, codecs);
} catch (FileNotFoundException e) {
  if (e.getMessage().startsWith(no segments* file found in)) {
// Might be that the Directory is empty, in which case just return an
// empty collection.
return Collections.emptyList();
  } else {
throw e;
  }
}
{code}

And now that test passes.

I'll continue discovering tests that fail ... probably backwards will have its 
share too :).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855369#action_12855369
 ] 

Shai Erera commented on LUCENE-2386:


I already did that ... just didn't post back. Created 
SegmentsFileNotFoundException.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2010-04-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855379#action_12855379
 ] 

Shai Erera commented on LUCENE-1879:


I have found such version ... and it fails too :). At least the one I received.

But never mind that ... as long as we both agree the implementation should 
change. I didn't mean to say anything bad about what you did .. I know the 
limitations you had to work with.

 Parallel incremental indexing
 -

 Key: LUCENE-1879
 URL: https://issues.apache.org/jira/browse/LUCENE-1879
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 3.1

 Attachments: parallel_incremental_indexing.tar


 A new feature that allows building parallel indexes and keeping them in sync 
 on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
 Find details on the wiki page for this feature:
 http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
 Discussion on java-dev:
 http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch fixes all tests as well as changes to IndexWriter, IndexFileDeleter, 
DirectoryReader and SegmentInfos.

I'd like to commit this shortly, before all the files get changed by a 
malicious other commit :). (kidding of course)

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855457#action_12855457
 ] 

Shai Erera commented on LUCENE-2386:


Ok sounds good. Is there a preferred package for exceptions? Or is o.a.l.index 
ok?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Move NoDeletionPolicy to core

2010-04-08 Thread Shai Erera
Hi

I've noticed benchmark has a NoDeletionPolicy class and I was wondering if
we can move it to core. I might want to use it for the parallel index stuff,
but I think it'll also fit nicely in core, together with the other No*
classes. In addition, this class should be made a singleton.

If moving to core is acceptable, do you think any bw policy needs to be
enforced (such as deprecating the one in benchmark and reference the one in
core? I'll also want to change the package name from o.a.l.benchmark.utils
to o.a.l.index, where the other IDPs are.

Simple move and change (and update to benchmark algs which use it.

Shai


[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854885#action_12854885
 ] 

Shai Erera commented on LUCENE-2074:


Uwe, must this be coupled with that issue? This one waits for a long time (why? 
for JFlex 1.5 release?) and protecting against a huge buffer allocation can be 
a real quick and tiny fix. And this one also focuses on getting Unicode 5 to 
work, which is unrelated to the buffer size. But the buffer size is not a 
critical issue either that we need to move fast with it ... so it's your call. 
Just thought they are two unrelated problems.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854887#action_12854887
 ] 

Shai Erera commented on LUCENE-2074:


bq. I plan to commit this soon! 

That's great news !

BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check 
it in somewhere? because it hasn't been released last I checked. I'm asking for 
general knowledge, because I know the scripts are downloading it, or rely on it 
to exist somewhere.

In that case, then yes, let's fix it here.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854920#action_12854920
 ] 

Shai Erera commented on LUCENE-1482:


I still think that calling isDebugEnabled is better, because the message 
formatting stuff may do unnecessary things like casting, autoboxing etc. IMO, 
if logging is enabled, evaluating it twice is not a big deal ... it's a simple 
check.

I'm glad someone here thinks logging will be useful though :). I wish there 
will be quorum here to proceed w/ that.

Note that I also offered to not create any dependency on SLF4J, but rather 
extract infoStream to a static InfoStream class, which will avoid passing it 
around everywhere, and give the flexibility to output stuff from other classes 
which don't have an infoStream at hand.

 Replace infoSteram by a logging framework (SLF4J)
 -

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
 slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar


 Lucene makes use of infoStream to output messages in its indexing code only. 
 For debugging purposes, when the search application is run on the customer 
 side, getting messages from other code flows, like search, query parsing, 
 analysis etc can be extremely useful.
 There are two main problems with infoStream today:
 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
 other classes I need to either expose an API or propagate infoStream to all 
 classes (see for example DocumentsWriter, which receives its infoStream 
 instance from IndexWriter).
 2. I can either turn debugging on or off, for the entire code.
 Introducing a logging framework can allow each class to control its logging 
 independently, and more importantly, allows the application to turn on 
 logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
 I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
 as it names states, a facade over different logging frameworks. As such, you 
 can include the slf4j.jar in your application, and it recognizes at deploy 
 time what is the actual logging framework you'd like to use. SLF4J comes with 
 several adapters for Java logging, Log4j and others. If you know your 
 application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
 your classpath, and your logging statements will use Java logging underneath 
 the covers.
 This makes the logging code very simple. For a class A the logger will be 
 instantiated like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
 }
 And will later be used like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
   public void foo() {
 if (logger.isDebugEnabled()) {
   logger.debug(message);
 }
   }
 }
 That's all !
 Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
 (but I assume it's fast also over other logging frameworks).
 The important thing is, every class controls its own logger. Not all classes 
 have to output logging messages, and we can improve Lucene's logging 
 gradually, w/o changing the API, by adding more logging messages to 
 interesting classes.
 I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855020#action_12855020
 ] 

Shai Erera commented on LUCENE-1709:


Robert, I will commit the patch, seems good to do anyway. We can handle the ant 
jars separately later.

And ths hang behavior is exactly what I experience, including the 
FileInputStream thing. Only on my machine, when I took a thread dump, it showed 
that Ant waits on FIS.read() ...

Robert - to remind you that even with the patch which forces junit to use a 
separate temp folder per thread, it still hung ... 

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)
Move NoDeletionPolicy from benchmark to core


 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1


As the subject says, but I'll also make it a singleton + add some unit tests, 
as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)
IndexWriter commits unnecessarily on fresh Directory


 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... 
why do we need that commit? Do we really expect people to open an IndexReader 
on an empty Directory which they just passed to an IW w/ create=true? If they 
want, they can simply call commit() right away on the IW they created.

I ran into this when writing a test which committed N times, then compared the 
number of commits (via IndexReader.listCommits) and was surprised to see N+1 
commits.

Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
jumping on me .. so the change might not be that simple. But I think it's 
manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Move NoDeletionPolicy to core, adds javadocs + TestNoDeletionPolicy. Also 
includes the relevant changes to benchmark (algorithms + CreateIndexTask).
I've fixed a typo I had in NoMergeScheduler - not related to this issue, but 
since it was just a typo, thought it's no harm to do it here.

Tests pass. Planning to commit shortly.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855131#action_12855131
 ] 

Shai Erera commented on LUCENE-2386:


Took a look at IndexFileDeleter, and located to offending code segment which is 
responsible for the IndexCorruptException:
{code}
if (currentCommitPoint == null) {
  // We did not in fact see the segments_N file
  // corresponding to the segmentInfos that was passed
  // in.  Yet, it must exist, because our caller holds
  // the write lock.  This can happen when the directory
  // listing was stale (eg when index accessed via NFS
  // client with stale directory listing cache).  So we
  // try now to explicitly open this commit point:
  SegmentInfos sis = new SegmentInfos();
  try {
sis.read(directory, segmentInfos.getCurrentSegmentFileName(), codecs);
  } catch (IOException e) {
throw new CorruptIndexException(failed to locate current segments_N 
file);
  }
{code}

Looks like this code protects against a real problem, which was raised on the 
list a couple of times already - stale NFS cache. So I'm reluctant to remove 
that check ... thought I still think we should differentiate between a newly 
created index on a fresh Directory, to a stale NFS problem. Maybe we can pass a 
boolean isNew or something like that to the ctor, and if it's a new index and 
the last commit point is missing, IFD will not throw the exception, but 
silently ignore that? So the code would become something like this:
{code}
if (currentCommitPoint == null  !isNew) {
   
}
{code}

Does this make sense, or am I missing something?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855140#action_12855140
 ] 

Shai Erera commented on LUCENE-2385:


I did that first, but then remembered that when I did that in the past, people 
were unable to apply my patches, w/o doing the svn move themselves. Anyway, for 
this file it's not really important I think - a very simple and tiny file, w/ 
no history to preserve? Is that ok for this file (b/c I have no idea how to do 
the svn move now ... after I've made all the changes already) :)

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855148#action_12855148
 ] 

Shai Erera commented on LUCENE-2386:


Looking at IFD again, I think a boolean ctor arg is not required. What I can do 
is check if any Lucene file has been seen (in the for-loop iteration on the 
Directory files), and if not, then deduce it's a new Directory, and skip that 
'if' check. I'll give it a shot.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Is it better now?

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855155#action_12855155
 ] 

Shai Erera commented on LUCENE-2385:


Forgot to mention that the only move I made was of NoDeletionPolicy:

svn move 
contrib/benchmark/src/java/org/apache/lucene/benchmark/utils/NoDeletionPolicy.java
 src/java/org/apache/lucene/index/NoDeletionPolicy.java

I'll remember that in the future Uwe - thanks for the heads up !

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2385.


Resolution: Fixed

Committed revision 932129.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

First stab at this. Patch still missing CHANGES entry, and I haven't run all 
the tests, just TestIndexWriter. With those changes it passes. One thing that I 
think should be fixed is testImmediateDiskFull - if I don't add 
writer.commit(), the test fails, because dir.getRecomputeActualSizeInBytes 
returns 0 (no RAMFiles yet), and then the test succeeds at adding one document. 
So maybe just change the test to set maxSizeInBytes to '1', always?

TestNoDeletionPolicy is not covered by this patch (should be fixed as well, 
because now the number of commits is exactly N and not N+1). Will fix it 
tomorrow.

Anyway, it's really late now, so hopefully some fresh eyes will look at it 
while I'm away, and comment on the proposed changes. I hope I got all the 
changes to the tests right.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855265#action_12855265
 ] 

Shai Erera commented on LUCENE-2386:


bq. Maybe change testImmediateDiskFull to set max allowed size to max(1, 
current-usage)?

Good idea ! Did it and it works.

Now ... one thing I haven't mentioned is the bw break. This is a behavioral bw 
break, which specifically I'm not so sure we should care about, because I 
wonder how many apps out there rely on being able to open a reader before they 
ever commited on a fresh new index. So what do you think - do this change 
anyway, OR ... utilize Version to our aid? I.e., if the Version that was passed 
to IWC is before LUCENE_31, we keep the initial commit, otherwise we don't do 
it? Pros is that I won't need to change many of the tests because they still 
use the LUCENE_30 version (but that is not a strong argument), so it's a weak 
Pro. Cons is that IW will keep having that doCommit handling in its ctor, only 
now w/ added comments on why this is being kept around etc.

What do you think?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



TestCodecs running time

2010-04-08 Thread Shai Erera
Hi

I've noticed that TestCodecs takes an insanely long time to run on my
machine - between 35-40 seconds. Is that expected?
The reason why it runs so long, seems to be that its threads make (each)
4000 iterations ... is that really required to ensure correctness?

Shai


Re: Controlling the maximum size of a segment during indexing

2010-04-08 Thread Shai Erera
I'm not sure .. but did you set the RAMBufferSizeMB on IWC? Doesn't look
like it, and the default is 16 MB, which can explain why it doesn't flush
before that.

Shai

On Fri, Apr 9, 2010 at 8:01 AM, Lance Norskog goks...@gmail.com wrote:

 Here is a Java unit test that uses the LogByteSizeMergePolicy to
 control the maximum size of segment files during indexing. That is, it
 tries. It does not succeed. Will someone who truly understands the
 merge policy code please examine it. There is probably one tiny
 parameter missing.

 It adds 20 documents that each are 100k in size.

 It creates an index in a RAMDirectory which should have one segment
 that's a tad over 1mb, and then a set of segments that are a tad over
 500k. Instead, the data does not flush until it commits, writing one
 5m segment.


 -
 org.apache.lucene.index.TestIndexWriterMergeMB

 ---

 package org.apache.lucene.index;

 /**
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the License); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an AS IS BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */

 import java.io.IOException;

 import org.apache.lucene.analysis.WhitespaceAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.FieldSelectorResult;
 import org.apache.lucene.document.Field.Index;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.util.LuceneTestCase;

 /*
  * Verify that segment sizes are limited to # of bytes.
  *
  * Sizing:
  *  Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x)
  *  Min MB is 10k.
  *  Each document is 100k.
  *  mergeSegments=2
  *  MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x)
  *
  *  This test should cause the ram buffer to flush after 10 documents,
 and create a CFS a little over 1meg.
  *  The later documents should be flushed to disk every 5-6 documents,
 and create CFS files a little over 0.5meg.
  */


 public class TestIndexWriterMergeMB extends LuceneTestCase {
  private static final int MERGE_FACTOR = 2;
  private static final double RAMBUFFER_MB = 1.0;
  static final double MIN_MB = 0.01d;
  static final double MAX_MB = 0.5d;
  static final double SLOP_FACTOR = 1.2d;
  static final double MB = 1000*1000;
  static String VALUE_100k = null;

  // Test controlling the mergePolicy for max # of docs
  public void testMaxMergeMB() throws IOException {
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(
TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT));

LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy();
config.setMergePolicy(mergeMB);
mergeMB.setMinMergeMB(MIN_MB);
mergeMB.setMaxMergeMB(MAX_MB);
mergeMB.setUseCompoundFile(true);
mergeMB.setMergeFactor(MERGE_FACTOR);
config.setMaxBufferedDocs(100);// irrelevant
 but the next line fails without this.
config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
MergeScheduler scheduler = new SerialMergeScheduler();
config.setMergeScheduler(scheduler);
IndexWriter writer = new IndexWriter(dir, config);

System.out.println(Start indexing);
for (int i = 0; i  50; i++) {
  addDoc(writer, i);
  printSegmentSizes(dir);
}
checkSegmentSizes(dir);
System.out.println(Commit);
writer.commit();
printSegmentSizes(dir);
checkSegmentSizes(dir);
writer.close();
  }

  // document that takes of 100k of RAM
  private void addDoc(IndexWriter writer, int i) throws IOException {
if (VALUE_100k == null) {
  StringBuilder value = new StringBuilder(10);
  for(int fill = 0; fill  10; fill ++) {
value.append('a');
  }
  VALUE_100k = value.toString();
}
Document doc = new Document();
doc.add(new Field(id, i + , Field.Store.YES,
 Field.Index.NOT_ANALYZED));
doc.add(new Field(content, VALUE_100k, Field.Store.YES,
 Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
  }


  private void checkSegmentSizes(Directory dir) {
try {
  String[] files = dir.listAll();
  for (String file : files) {
if 

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855277#action_12855277
 ] 

Shai Erera commented on LUCENE-2386:


Apparently, there are more tests that fail ... lost count but easy fixing. I 
tried writing the following test:

{code}
  public void testNoCommits() throws Exception {
// Tests that if we don't call commit(), the directory has 0 commits. This 
has
// changed since LUCENE-2386, where before IW would always commit on a fresh
// new index.
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new 
IndexWriterConfig(TEST_VERSION_CURRENT, new 
WhitespaceAnalyzer(TEST_VERSION_CURRENT)));
assertEquals(expected 0 commits!, 0, IndexReader.listCommits(dir).size());
// No changes still should generate a commit, because it's a new index.
writer.close();
assertEquals(expected 1 commits!, 0, IndexReader.listCommits(dir).size());
  }
{code}

Simple test - validates that no commits are present following a freshly new 
index creation, w/o closing or committing. However, IndexReader.listCommits 
fails w/ the following exception:

{code}
java.io.FileNotFoundException: no segments* file found in 
org.apache.lucene.store.ramdirect...@2d262d26: files: []
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:652)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:323)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1033)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1023)
at 
org.apache.lucene.index.IndexReader.listCommits(IndexReader.java:1341)
at 
org.apache.lucene.index.TestIndexWriter.testNoCommits(TestIndexWriter.java:4966)
   
{code}

The failure occurs when SegmentInfos attempts to find segments.gen and fails. 
So I wonder if I should fix DirectoryReader to catch that exception and simply 
return an empty Collection .. or I should fix SegmentInfos at this point -- 
notice the files: [] at the end - I think that by adding a check to the 
following code (SegmentInfos, line 652) which validates that there were any 
files before throwing the exception, it'll still work properly and safely (i.e. 
to detect a problematic Directory). Will need probably to break away from the 
while loop and I guess fix some other things in upper layers ... therefore I'm 
not sure if I should not simply catch that exception in 
DirectoryReader.listCommits w/ proper documentation and be done w/ it. After 
all, it's not supposed to be called ... ever? or hardly ever?

{code}
  if (gen == -1) {
// Neither approach found a generation
throw new FileNotFoundException(no segments* file found in  + 
directory + : files:  + Arrays.toString(files));
  }
{code}

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1709) Parallelize Tests

2010-04-07 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1709:
---

Attachment: LUCENE-1709-2.patch

Since I had the changes on my local env. I thought it's best to generate a 
patch out of them, so they don't get lost. The patch doesn't cover the ant 
.jars, only the changes to common-build.xml as well as benchmark/build.xml

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-07 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2377.


Resolution: Fixed

Committed revision 931502.

 Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
 -

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2377.patch


 Benchmark allows one to set the MP and MS to use, by defining the class name 
 and then use reflection to instantiate them. However NoMP and NoMS are 
 singletons and therefore reflection does not work for them. Easy fix in 
 CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-04-07 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854588#action_12854588
 ] 

Shai Erera commented on LUCENE-2353:


Actually, we've reopened LUCENE-1709 to track that. This is not related to this 
issue's changes, but seems to be related to benchmark test in specifically. 
Please have a look there at a patch I've posted which forces benchmark tests to 
run in sequential mode. Additionally, you can 'ant test -Drunsequential=1' from 
the command line, benchmark's root folder, to achieve the same.
And it'd be great if you post the above on LUCENE-1709 as well -- because now I 
know I'm not the only one running into this :).

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Getting fsync out of the loop

2010-04-06 Thread Shai Erera
How often is fsync called? If it's just during calls to commit, then is that
that expensive? I mean, how often do you call commit?

If that's that expensive (do you have some numbers to share) then I think
that's be a neat idea. Though losing a few minutes worth of updates may
sometimes be unrecoverable, depending on the scenario, bur I guess for those
cases the 'standard way' should be used.

What if your background thread simply committed every couple of minutes?
What's the difference between taking the snapshot (which means you had to
call commit previously) and commit it, to call iw.commit by a backgroud
merge?

Shai

On Tue, Apr 6, 2010 at 5:11 PM, Earwin Burrfoot ear...@gmail.com wrote:

 So, I want to pump my IndexWriter hard and fast with documents.

 Removing fsync from FSDirectory helps. But for that I pay with possibility
 of
 index corruption, not only if my node suddenly loses
 power/kernelpanics, but also if it
 runs out of disk space (which happens more frequently).

 I invented the following solution:
 We write a special deletion policy that resembles SnapshotDeletionPolicy.
 At all times it takes hold of current synced commit and preserves
 it. Once every N minutes
 a special thread takes latest commit, syncs it and nominates as
 current synced commit. The
 previous one gets deleted.

 Now we are disastery-proof, and do fsync asynchronously from indexing
 threads. We pay for this with
 somewhat bigger transient disc usage, and probably losing a few
 minutes worth of updates in
 case of a crash, but that's acceptable.

 How does this sound?

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Getting fsync out of the loop

2010-04-06 Thread Shai Erera
Earwin - do you have some numbers to share on the running time of the
indexing application? You've mentioned that if you take out fsync into a BG
thread, the running time improves, but I'm curious to know by how much.

Shai

On Wed, Apr 7, 2010 at 2:26 AM, Earwin Burrfoot ear...@gmail.com wrote:

  Running out of disk space with fsync disabled won't lead to corruption.
  Even kill -9 the JRE process with fsync disabled won't corrupt.
  In these cases index just falls back to last successful commit.
 
  It's only power loss / OS / machine crash where you need fsync to
  avoid possible corruption (corruption may not even occur w/o fsync if
  you get lucky).

 Sorry to disappoint you, but running out of disk space is worse than kill
 -9.
 You can write down the file (to cache in fact), close it, all without
 getting any
 exceptions. And then it won't get flushed to disk because the disk is full.
 This can happen to segments file (and old one is deleted with default
 deletion
 policy). This can happen to fat freq/prox files mentioned in segments file
 (and yeah, the old segments file is deleted, so no falling back).

  What if your background thread simply committed every couple of minutes?
  What's the difference between taking the snapshot (which means you had
  to call commit previously) and commit it, to call iw.commit by a
 backgroud merge?
 --
  But: why do you need to commit so often?
 To see stuff on reopen? Yes, I know about NRT.

  You've reinvented autocommit=true!
 ?? I'm doing regular commits, syncing down every Nth of it.

  Doesn't this just BG the syncing?  Ie you could make a dedicated
  thread to do this.
 Yes, exactly, this BGs the syncing to a dedicated thread. Threads
 doing indexation/merging can continue unhampered.

  One possible win with this aproach is the cost of fsync should go
  way down the longer you wait after writing bytes to the file and
  before calling fsync.  This is because typically OS write caches
  expire by time (eg 30 seconds) so if you want long enough the bytes
  will already at least be delivered to the IO system (but the IO system
  can do further caching which could still take time).  On windows at
  least I definitely noticed this effect -- wait some before fync'ing
  and it's net/net much less costly.
 Yup. In fact you can just hold on to the latest commit for N seconds,
 than switch to the new latest commit.
 OS will fsync everything for you.


 I'm just playing around with stupid idea. I'd like to have NRT
 look-alike without binding readers and writers. :)
 Right now it's probably best for me to save my time and cut over to current
 NRT.
 But. An important lesson was learnt - no fsyncing blows up your index
 on out-of-disk-space.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854348#action_12854348
 ] 

Shai Erera commented on LUCENE-1709:


One more thing - change benchmark tests to run sequentially (by adding the 
property).
Robert, are you going to tackle that soon?

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-06 Thread Shai Erera (JIRA)
Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
-

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1


Benchmark allows one to set the MP and MS to use, by defining the class name 
and then use reflection to instantiate them. However NoMP and NoMS are 
singletons and therefore reflection does not work for them. Easy fix in 
CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-06 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2377:
---

Attachment: LUCENE-2377.patch

Patch includes both fix to CreateIndexTask as well as relevant tests to 
CreateIndexTaskTest. I plan to commit later today if there are no objections.

 Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
 -

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2377.patch


 Benchmark allows one to set the MP and MS to use, by defining the class name 
 and then use reflection to instantiate them. However NoMP and NoMS are 
 singletons and therefore reflection does not work for them. Easy fix in 
 CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Parallel tests in Benchmark

2010-04-03 Thread Shai Erera
Ok let's do that (add runsequential to benchmark and all the rest). If
I'll run into this elsewhere as well I will report and we can talk
then about trying to find a solution for this. If it's just benchmark
then I think we'll be ok.

Shai

On Thursday, April 1, 2010, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 1, 2010 at 12:03 AM, Shai Erera ser...@gmail.com wrote:


 Hi

 I'd like to summarize a discussion I had w/ Robert and Mike last night on 
 IRC, about the parallelism of tasks in Benchmark:

 For some reason, ever since parallel tasks were introduced, when I run 'ant 
 test' from the contrib/benchmark folder (or the root), the tests just hang at 
 some point, after WriteLineDocTaskTest finishes. What's very weird is that it 
 seems I'm the only one experiencing this, and so for a long time I thought 
 it's just a problem w/ my environment ... until yesterday when I did a fresh 
 checkout of trunk, to a fresh folder and project, and still the tests stuck.

 Thread dump does not show anything relevant to Lucene code, but rather to 
 Ant. The main thread is waiting on 
 org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on 
 org/apache/tools/ant/taskdefs/Execute.waitFor and two other on 
 java/io/FileInputStream.read. But nothing is related to Lucene code, 
 directly. Also annoyingly, but conveniently for debugging that issue, it 
 happens very consistently on my machine - sometimes the test passes, but 90% 
 hangs.
 Running w/ -Drunsequential=1 consistently succeeds.

 We've explored different ways to understand the cause of the problem, and 
 came across several improvements and a workaround, but unfortunately not to a 
 definite resolution:

 * As a last resort, we can add runsequential property to benchmark build.xml, 
 which forces Benchmark tests to run sequentially. Since that's a tiny package 
 which takes a few seconds to run anyway, and parallelism doesn't improve much 
 (it actually runs slower, when it passes, on my machine: parallel=15 sec, 
 seq=11 sec), this might be acceptable.

 * Moving the junit temp files (such as that flag file) created to the temp 
 directory each test uses. This is actually a good thing to do anyway (thanks 
 Robert for spotting that), because it avoids accidental commits of such files 
 :), as well as doesn't clutter the main environment. We've done that because 
 when I hit CTR:+C to stop one of the runs which hung, we received a FNFE on a 
 junit flag file is being accessed by another process (something like that), 
 and thought this is related to the hangs I'm seeing. Anyway, this file is 
 attempted access by multiple JVMs concurrently, which seems bad.

 * Explore the JUnit Formatter code under src/test, since it uses file 
 locking. I've disabled locks (using NoLockFactory), however the test still 
 hung.

 * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We think 
 that might be a good thing to do anyway - if people run on machines with just 
 one CPU, threading is not expected to help much, as opposed to running on 
 multiple CPUs. But we don't want to enforce it on anyone, so we think to 
 change the default to '1', but introduce a property 'threadsPerProcessor' 
 which users will be able to set explicitly.
 ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad 
 W500), the test consistently passes - it just doesn't like the value '2'. At 
 least it passed as long as I ran it, maybe a thread hang is lurking for me 
 around the corner somewhere.

 * We made sure the benchmark tests indeed read/write the test data files 
 from/to unique directories. But like I said - there is no hang in Lucene code 
 reported in the thread dump.

 It was very late last night when we stopped, and my eyes were tired, so I 
 didn't summarize it right away. Robert, I hope I've captured everything we 
 did, if not please add.

 Anyone's got any suggestions? It's unfortunate that I'm the only one running 
 into this problem, because whatever the suggestions are, you'll probably need 
 me to confirm them :). And I'm going away for 3 days (camping - no internet 
 ... well at least no laptop :)), so unless someone has a suggestion within 
 the coming few hours, we can continue that when I get back.

 Shai


 I think you got everything. I reopened the JIRA issue too (LUCENE-1709) and 
 listed the things we can do for sure now, such as lowering 
 threadsPerProcessor (and allowing someone to use a system property to 
 override this) and fixing junit temp files to be in the temp directory. 
 Additionally I would like to fix the ant library problem as mentioned there. 
 it works great from the command-line but we should improve this for 
 IDE-users, so they do not see a compile error.

 I am personally for the idea of adding the runsequential property to 
 benchmark's build.xml, to force it to run serially. While I am unable to 
 reproduce your problem, it does not surprise me, as I had a tough time trying 
 to prevent benchmark

Re: Landing the flex branch

2010-04-03 Thread Shai Erera
bq. Try a merge back: This would let flex appear as a single commit to
trunk, so the history of trunk would be preserved.

 +1 for that - I think the history of trunk is important to preserve.
And there is also a way to ask for flex's history so everybody win?

Shai

On Thursday, April 1, 2010, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 we should think about how to merge the changes to trunk. I can try this out 
 during the weekend, to merge back the changes to trunk, but this can be very 
 hard. So we have the following options:

 Try a merge back: This would let flex appear as a single commit to trunk, so 
 the history of trunk would be preserved. If somebody wants to see the changes 
 in the flex branch, he could ask for them (e.g. in TortoiseSVN there is a 
 checkbox Include merged revisions). If this is not easy or fails, we can do 
 the following:

 - Create a big diff between current trunk and flex (after flex is merged up 
 to trunk). Attach this patch to an issue and let everybody review. After that 
 we can apply the patch to trunk. This would result in the same behavior for 
 trunk, no changes lost, but all changes in flex cannot be reviewed.
 - Delete current trunk and svn move the branch to trunk (after flex is merged 
 up to trunk): This would make the history of flex the current history. The 
 drawback: You losse latest trunk changes since the split of flex. Instead you 
 will only see the merge messages. Therefore we should see this only as a last 
 chance.

 Comments?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, March 30, 2010 5:35 PM
 To: java-dev@lucene.apache.org
 Subject: Landing the flex branch

 I think the time has finally come!  Pending one issue (LUCENE-2354 --
 Uwe), I think flex is ready to land I think the other issues with
 Fix
 Version = Flex Branch can be moved to 3.1 after we land.

 We still use the pre-flex APIs in a number of places... I think this
 is actually good (so we continue to test the back-compat emulation
 layer).  With time we can cut them over.

 After flex, there are a number of fun things to explore.  EG, we need
 to make attributes work well with codecs  indexing/searching (with
 Multi/DirReader, serailize/unserialize, etc.); we need a BytesRef +
 packed ints FieldCache StringIndex variant which should use much less
 RAM in certain cases; we should build a fast core PForDelta codec;
 more queries can cutover to operating directly on byte[] terms, etc.
 But these can all come with time...

 Thoughts/issues/objections?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Uwe Schindler to the Lucene PMC

2010-04-01 Thread Shai Erera
Congratulations Uwe !

Shai

On Thursday, April 1, 2010, Earwin Burrfoot ear...@gmail.com wrote:
 Generics SpecOps made it to the top and are gonna rule us from the
 shadows :)  Congrats!

 On Thu, Apr 1, 2010 at 16:37, Robert Muir rcm...@gmail.com wrote:
 Congrats Uwe!

 On Thu, Apr 1, 2010 at 7:05 AM, Grant Ingersoll gsing...@apache.org wrote:

 I'm pleased to announce that the Lucene PMC has voted to add Uwe Schindler
 to the PMC.  Uwe has been doing a lot of work in Lucene and Solr, including
 several of the last releases in Lucene.

 Please join me in extending congratulations to Uwe!

 -Grant Ingersoll
 PMC Chair
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 Robert Muir
 rcm...@gmail.com




 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-31 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851829#action_12851829
 ] 

Shai Erera commented on LUCENE-2310:


+1 for this simplification. Can we just name it Indexable, and omit Document 
from it? That way, it's both shorter and less chances for users to directly 
link it w/ Document.

One thing I didn't understand though, is what will happen to ir/is.doc() 
method? Will those be deprecated in favor of some other class which receives an 
IR as parameter and knows how to re-construct Indexable(Document)?

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male
 Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields-core.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch


 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-31 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera reassigned LUCENE-2353:
--

Assignee: Shai Erera

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-31 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851836#action_12851836
 ] 

Shai Erera commented on LUCENE-2353:


Unless there are objections, I plan to commit this shortly

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-31 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851842#action_12851842
 ] 

Shai Erera commented on LUCENE-2310:


Right Earwin - agreed.

I'd like to summarize a brief discussion we had on IRC around that:
The idea is not to provide another interface/class for search purposes, but 
rather expose the right API from IndexReader, even if it might be a bit 
low-level. API like getIndexedFields(docId) and getStorefFields(docId), both 
optionally take a FieldSelector, should allow the application to re-construct 
its Indexable however it wants. And IR/IS don't need to know anything about 
that.
To complete the picture for current users, we can have a static reconstruct() 
on Document which takes IR, docId and FieldSelector ...

BTW, I'm not even sure getIndedxedFields can be efficiently supported today. 
Just listing it here for completeness.

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male
 Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields-core.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch


 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-31 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2353.


Resolution: Fixed

Committed revision 929520.

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Parallel tests in Benchmark

2010-03-31 Thread Shai Erera
Hi

I'd like to summarize a discussion I had w/ Robert and Mike last night on
IRC, about the parallelism of tasks in Benchmark:

For some reason, ever since parallel tasks were introduced, when I run 'ant
test' from the contrib/benchmark folder (or the root), the tests just hang
at some point, after WriteLineDocTaskTest finishes. What's very weird is
that it seems I'm the only one experiencing this, and so for a long time I
thought it's just a problem w/ my environment ... until yesterday when I did
a fresh checkout of trunk, to a fresh folder and project, and still the
tests stuck.

Thread dump does not show anything relevant to Lucene code, but rather to
Ant. The main thread is waiting on
org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on
org/apache/tools/ant/taskdefs/Execute.waitFor and two other on
java/io/FileInputStream.read. But nothing is related to Lucene code,
directly. Also annoyingly, but conveniently for debugging that issue, it
happens very consistently on my machine - sometimes the test passes, but 90%
hangs.
Running w/ -Drunsequential=1 consistently succeeds.

We've explored different ways to understand the cause of the problem, and
came across several improvements and a workaround, but unfortunately not to
a definite resolution:

* As a last resort, we can add runsequential property to benchmark
build.xml, which forces Benchmark tests to run sequentially. Since that's a
tiny package which takes a few seconds to run anyway, and parallelism
doesn't improve much (it actually runs slower, when it passes, on my
machine: parallel=15 sec, seq=11 sec), this might be acceptable.

* Moving the junit temp files (such as that flag file) created to the temp
directory each test uses. This is actually a good thing to do anyway (thanks
Robert for spotting that), because it avoids accidental commits of such
files :), as well as doesn't clutter the main environment. We've done that
because when I hit CTR:+C to stop one of the runs which hung, we received a
FNFE on a junit flag file is being accessed by another process (something
like that), and thought this is related to the hangs I'm seeing. Anyway,
this file is attempted access by multiple JVMs concurrently, which seems
bad.

* Explore the JUnit Formatter code under src/test, since it uses file
locking. I've disabled locks (using NoLockFactory), however the test still
hung.

* Change common-build.xml threadsPerProcessor to '1' instead of '2'. We
think that might be a good thing to do anyway - if people run on machines
with just one CPU, threading is not expected to help much, as opposed to
running on multiple CPUs. But we don't want to enforce it on anyone, so we
think to change the default to '1', but introduce a property
'threadsPerProcessor' which users will be able to set explicitly.
** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad
W500), the test consistently passes - it just doesn't like the value '2'. At
least it passed as long as I ran it, maybe a thread hang is lurking for me
around the corner somewhere.

* We made sure the benchmark tests indeed read/write the test data files
from/to unique directories. But like I said - there is no hang in Lucene
code reported in the thread dump.

It was very late last night when we stopped, and my eyes were tired, so I
didn't summarize it right away. Robert, I hope I've captured everything we
did, if not please add.

Anyone's got any suggestions? It's unfortunate that I'm the only one running
into this problem, because whatever the suggestions are, you'll probably
need me to confirm them :). And I'm going away for 3 days (camping - no
internet ... well at least no laptop :)), so unless someone has a suggestion
within the coming few hours, we can continue that when I get back.

Shai


[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-29 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2353:
---

Attachment: LUCENE-2353.patch

Updated to also match 'c:/temp' like paths, which are also accepted on Windows

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-28 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850644#action_12850644
 ] 

Shai Erera commented on LUCENE-2353:


I don't have an account yet, so I cannot commit this on my own. Any volunteers?

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-27 Thread Shai Erera (JIRA)
Config incorrectly handles Windows absolute pathnames
-

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 3.1


I have no idea how no one ran into this so far, but I tried to execute an .alg 
file which used ReutersContentSource and referenced both docs.dir and work.dir 
as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run 
reported an error of missing content under benchmark\work\something.

I've traced the problem back to Config, where get(String, String) includes the 
following code:
{code}
if (sval.indexOf(:)  0) {
  return sval;
}
// first time this prop is extracted by round
int k = sval.indexOf(:);
String colName = sval.substring(0, k);
sval = sval.substring(k + 1);
...
{code}

It detects : in the value and so it thinks it's a per-round property, thus 
stripping d: from the value ... fix is very simple:
{code}
if (sval.indexOf(:)  0) {
  return sval;
} else if (sval.indexOf(:\\) = 0) {
  // this previously messed up absolute path names on Windows. Assuming
  // there is no real value that starts with \\
  return sval;
}
// first time this prop is extracted by round
int k = sval.indexOf(:);
String colName = sval.substring(0, k);
sval = sval.substring(k + 1);
{code}

I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2353:
---

Attachment: LUCENE-2353.patch

The fix is only relevant to get(String, String) and not to all other 
get(String, type) variants.

Benchmark test passed but after I svn up (to include the latest parallel test 
thing) the test just sits idle (after finishing), waiting for something. If I 
run the tests in eclipse they pass. So I'm guessing it's a problem w/ my env. 
or build.xml?

I also tried 'ant clean test' from within benchmark, but it didn't help. I then 
tried 'ant clean' from root, and 'ant test' from benchmark, but the test just 
keeps waiting on WriteLineDocTaskTest, on this line:
[junit]  config properties:
[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line
[junit] ---

I think this can go in (if it passes on someone else's machine, while I figure 
out what's wrong in my env. separately.

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850075#action_12850075
 ] 

Shai Erera commented on LUCENE-2345:


Earwin, w/o knowing too much about the details of your work, I wanted to 
comment on get rid of of init/reinit/moreinit methods, moving the code to 
constructors. I work now on Parallel Index and one of the things I do is 
extend IW. Currently, IW's ctor code performs the initialization, however I'm 
thinking to move that code to an init method. The reason is to allow easy 
extensions of IW, such as LUCENE-2330. There I'm going to add a default ctor to 
IW, accompanied by an init method the extending class can call if needed. So 
what I'm trying to say is that init methods are not always bad, and sometimes 
ctors limit you. Perhaps it would make sense though in what you're trying to do 
...

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850083#action_12850083
 ] 

Shai Erera commented on LUCENE-2345:


Thanks Uwe, I know that ctor is the preferred way, and in the process of 
introducing IWC I delete IW.init which all ctors called and pulled all the code 
to IW ctor. I will make that init() on IW final. But sometimes putting code in 
init() is not bad (and it's used in Lucene elsewhere too (e.g. PQ and up until 
recently IW).

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850086#action_12850086
 ] 

Shai Erera commented on LUCENE-2215:


Sure let's wait for the patch and some perf. results.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >