from:"Shai Erera"


 [ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2316.


Lucene Fields: [New, Patch Available]  (was: [New])
 Assignee: Shai Erera
   Resolution: Fixed

Committed revision 933879.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2316.patch


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

[
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856845#action_12856845
]

Shai Erera commented on LUCENE-2159:

This looks like a nice tool. But all it does is create multiple copies of the
same segment(s) right? So what exactly do you want to test with it? What
worries me is that we'll be multiplying the lexicon, posting lists, statistics
etc., therefore I'm not sure how reliable the tests will be (whatever they
are), except for measuring things related to large number of segments (like
merge performance). Am I right?

I also think this class better fits in benchmark rather than misc, as it's
really for perf. testing/measurements and not as a generic utility ... You can
create a Task out if it, like ExpandIndexTask which one can include in his
algorithm.

Tool to expand the index for perf/stress testing.
-

Key: LUCENE-2159
URL: https://issues.apache.org/jira/browse/LUCENE-2159
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
Attachments: ExpandIndex.java

Sometimes it is useful to take a small-ish index and expand it into a large
index with K segments for perf/stress testing.
This tool does that. See attached class.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.


[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856877#action_12856877
 ] 

Shai Erera commented on LUCENE-2159:


bq. I understand having a general performance suite to test regression is a 
good thing. But we found having a more focused test for segmentation and merge 
is important.

Are you saying that because of the benchmark proposal? I still think that an 
ExpandIndexTask will be useful for benchmark and fits better there, than in 
contrib/misc. We can have that task together w/ a predefined .alg for using it 
...

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

[
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856911#action_12856911
]

Shai Erera commented on LUCENE-2159:

Which is fine - I think this would be a neat task to add to benchmark, w/
specific documentation on how to use it and for what purposes. If you can also
write a sample .alg file which e.g. creates a small index and then Expand it,
that'd be great.

I've looked at the different PerfTask implementations in benchmark, and I'm
thinking if we perhaps should do the following:
* Create an AddIndexesTask which receives one or more Directories as input and
calls writer.addIndexesNoOptimize
* If one wants, he can add an OptimizeTask call afterwards.
* Write an expandIndex.alg which initially creates an index of size N from one
content source and then calls the AddIndexesTask several times. The .alg file
is meant to be an example as well as people can change it to create bigger or
smaller indexes, use other content sources and switch between RAM/FS
directories.

How's that sound?

Tool to expand the index for perf/stress testing.
-

Sometimes it is useful to take a small-ish index and expand it into a large
index with K segments for perf/stress testing.
This tool does that. See attached class.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

[
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856917#action_12856917
]

Shai Erera commented on LUCENE-2159:

bq. There is an excellent section on it in LIA2

Indeed !

Ok so to create a task, you just extend PerfTask. You can look under
contrib/benchmark/src/java/o.a.l/benchmark/byTask/tasks for many examples.
OptimizeTask seems relevant here (i.e. it calls an IW API and receives a
parameter).

For writing .alg files, that's SUPER simple, just look under
contrib/benchmark/conf for many existing examples. You can post a patch once
you feel comfortable enough with it and I can help you with the struggles (if
you'll run into any). Another great source (besides LIA2) on writing .alg files
is the package.html under
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask.

Tool to expand the index for perf/stress testing.
-

Sometimes it is useful to take a small-ish index and expand it into a large
index with K segments for perf/stress testing.
This tool does that. See attached class.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera

Ahh ... a dream finally comes true ... what a great way to start a day :).
+1 !!!

I have some questions/comments though:

* Index back compat should be maintained between major releases, like it is
today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
call optimize() to ensure 4.0 still works on their index. I hope that will
still be the case? Otherwise I don't see how we can prevent reindexing by
apps.
** Index behavioral/runtime changes, like those of Analyzers, are ok to
require a reindex, as proposed.

So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
API? Cool and convenient. For how long do we keep the 3.1 branch around?
Also, it used to only fix bugs, but from now on it'll be allowed to
introduce new features, if they maintain back-compat? So 3.1.1 can have
'flex' (going for the extreme on purpose) if someone maintains back-compat?

I think the back-compat on branches should be only for index runtime
changes. There's no point, in my opinion, to maintain API back-compat
anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
just to get a new feature but get it API back-supported? As soon as they
upgrade to 3.2, that means a new set of API right?

Major releases will just change the index structure format then? Or move to
Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
Java 1.6 ... no API back-compat right :).

That's definitely a great step forward !

Shai

On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org wrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

 Thoughts...?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/?? ? (ear...@gmail.com)

 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera

Also, we will still need to maintain the Backwards section in CHANGES (or
move it to API Changes), to help people upgrade from release to release.
Just pointing that out as well.

Shai

On Thu, Apr 15, 2010 at 7:05 AM, Shai Erera ser...@gmail.com wrote:

 Ahh ... a dream finally comes true ... what a great way to start a day :).
 +1 !!!

 I have some questions/comments though:

 * Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
 segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
 call optimize() to ensure 4.0 still works on their index. I hope that will
 still be the case? Otherwise I don't see how we can prevent reindexing by
 apps.
 ** Index behavioral/runtime changes, like those of Analyzers, are ok to
 require a reindex, as proposed.

 So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
 API? Cool and convenient. For how long do we keep the 3.1 branch around?
 Also, it used to only fix bugs, but from now on it'll be allowed to
 introduce new features, if they maintain back-compat? So 3.1.1 can have
 'flex' (going for the extreme on purpose) if someone maintains back-compat?

 I think the back-compat on branches should be only for index runtime
 changes. There's no point, in my opinion, to maintain API back-compat
 anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
 just to get a new feature but get it API back-supported? As soon as they
 upgrade to 3.2, that means a new set of API right?

 Major releases will just change the index structure format then? Or move to
 Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
 Java 1.6 ... no API back-compat right :).

 That's definitely a great step forward !

 Shai


 On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.orgwrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

 Thoughts...?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/?? ? (ear...@gmail.com)

 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-14 Thread Shai Erera

So then I don't understand this:

{quote}
* A major release always bumps the major release number (2.x -
   3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
   releases along that branch

* There is no back compat across major releases (index nor APIs),
   but full back compat within branches.

{quote}

What's different than what's done today? How can we remove Version in that
world, if we need to maintain full back-compat between 3.1 and 3.2, index
and API-wise? We'll still need to deprecate and come up w/ new classes every
time, and we'll still need to maintain runtime changes back-compat.

Unless you're telling me we'll start releasing major releases more often?
Well ... then we're saying the same thing, only I think that instead of
releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
because if you look back, every minor release included API deprecations as
well as back-compat breaks. That means that every minor release should have
been a major release right?

Point is, if I understand correctly and you agree w/ my statement above - I
don't see why would anyone releases a 3.x after 4.0 is out unless someone
really wants to work hard on maintaining back-compat of some features.

If it's just a numbering thing, then I don't think it matters what is
defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor
X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer
the latter but don't have any strong feelings against the former. Just
pointing out that X will grow more rapidly than today. That's all.

So did I get it right?

Shai

On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote:

 I don't read what you wrote and what Mike wrote as even close to the same.

 - Mark

 http://www.lucidimagination.com (mobile)

 On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote:

 Ahh ... a dream finally comes true ... what a great way to start a day :).
 +1 !!!

 I have some questions/comments though:

 * Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
 segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
 call optimize() to ensure 4.0 still works on their index. I hope that will
 still be the case? Otherwise I don't see how we can prevent reindexing by
 apps.
 ** Index behavioral/runtime changes, like those of Analyzers, are ok to
 require a reindex, as proposed.

 So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
 API? Cool and convenient. For how long do we keep the 3.1 branch around?
 Also, it used to only fix bugs, but from now on it'll be allowed to
 introduce new features, if they maintain back-compat? So 3.1.1 can have
 'flex' (going for the extreme on purpose) if someone maintains back-compat?

 I think the back-compat on branches should be only for index runtime
 changes. There's no point, in my opinion, to maintain API back-compat
 anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
 just to get a new feature but get it API back-supported? As soon as they
 upgrade to 3.2, that means a new set of API right?

 Major releases will just change the index structure format then? Or move to
 Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
 Java 1.6 ... no API back-compat right :).

 That's definitely a great step forward !

 Shai

 On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda  va...@osafoundation.org
 va...@osafoundation.org wrote:


 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

  Can't believe my eyes.

 +1


 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
  luc...@mikemccandless.comluc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
  mar...@rectangular.commar...@rectangular.com wrote:

  Essentially, we're free to break back compat within Lucy at any time,
 but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.


 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when

[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-13 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2386.


Resolution: Fixed

Committed revision 933613. (take #2)

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Proposal about Version API relaxation

2010-04-13 Thread Shai Erera

Hi

I'd like to propose a relaxation on the Version API. Uwe, please read the
entire email before you reply :).

I was thinking, following a question on the user list, that the
Version-based API may not be very intuitive to users, especially those who
don't care about versioning, as well as very inconvenient. So there are two
issues here:
1) How should one use Version smartly so that he keeps backwards
compatibility. I think we all know the answer, but a Wiki page with some
best practices tips would really help users use it.
2) How can one write sane code, which doesn't pass versions all over the
place if: (1) he doesn't care about versions, or (2) he cares, and sets the
Version to the same value in his app, in all places.

Also, I think that today we offer a flexibility to users, to set different
Versions on different objects in the life span of their application - which
is a good flexibility but can also lead people to shoot themselves in the
legs if they're not careful -- e.g. upgrading Version across their app, but
failing to do so for one or two places ...

So the change I'd like to propose is to mostly alleviate (2) and better
protect users - I DO NOT PROPOSE TO GET RID OF Version :).

I was thinking that we can add on Version a DEFAULT version, which the
caller can set. So Version.setDefault and Version.getDefault will be added,
as static members (more on the static-ness of it later). We then change the
API which requires Version to also expose an API which doesn't require it,
and that API will call Version.getDefault(). People can use it if they want
to ...

Few points:
1) As a default DEFAULT Version is controversial, I don't want to propose
it, even though I think Lucene can define the DEFAULT to be the latest.
Instead, I propose that Version.getDefault throw a
DefaultVersionNotSetException if it wasn't set, while an API which relies on
the default Version is called (I don't want to return null, not sure how
safe it is).
2) That DEFAULT Version is static, which means it will affect all indexing
code running inside the JVM. Which is fine:
2.1) Perhaps all the indexing code should use the same Version
2.2) If you know that's not the case, then pass Version to the API which
requires it - you cannot use the 'default Version' API -- nothing changes
for you.
One case is missing -- you might not know if your code is the only indexing
code which runs in the JVM ... I don't have a solution to that, but I think
it'll be revealed pretty quickly, and you can change your code then ...

So to summarize - the current Version API will remain and people can still
use it. The DEFAULT Version API is meant for convenience for those who don't
want to pass Version everywhere, for the reasons I outlined above. This will
also clean our test code significantly, as the tests will set the DEFAULT
version to TEST_VERSION_CURRENT at start ...

The changes to the Version class will be very simple.

If people think that's acceptable, I can open an issue and work on it.

Shai

[jira] Updated: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-04-13 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2316:
---

Attachment: LUCENE-2316.patch

Patch clarifies the contract, fixes the directories to adhere to it and adds a 
CHANGES under backwards section. All tests pass.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2316.patch


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera

Well the no-arg ctor will be using Version.getDefault() which will
throw an exception if not set, and delegate the call to the
Version-aware ctor.

On Tuesday, April 13, 2010, Robert Muir rcm...@gmail.com wrote:
 On Tue, Apr 13, 2010 at 11:27 AM, Shai Erera ser...@gmail.com wrote:


 I was thinking that we can add on Version a DEFAULT version, which the caller 
 can set. So Version.setDefault and Version.getDefault will be added, as 
 static members (more on the static-ness of it later). We then change the API 
 which requires Version to also expose an API which doesn't require it, and 
 that API will call Version.getDefault(). People can use it if they want to ...

 I don't understand how this works... if Something has a no-arg ctor today, 
 and i want to improve it in a backwards-compatible way, how will this work?
 the way this works today, lets say while working with 3.1 is:

 Something() is deprecated, and invokes Something(3.0)Something(Version) is 
 added, and emulates the old behavior for  3.1, and the new behavior for = 
 3.1
 i dont see how backwards compatibility will work with this proposal, since 
 the no-arg ctor would then emulate some random behavior depending on a static.


 --
 Robert Muir
 rcm...@gmail.com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera

 That is a static default!

Yes Uwe ... I'm aware of that :)
But that's not a static default for Lucene ... only for the application, if
it chooses to use it ...

 so there are no plans to reimplement such a thing again

Well ... that's not exactly what I'm proposing here. I'm not for
re-implementing any sort of staticness, unless the app chooses to use it.
And please don't give me that 'there are no plans ...' answer - it kind of
kills the discussion, which is not healthy for a community.

I agree that static variables might cause troubles to some deployments, BUT:

1) Not all apps are deployed on a Web Server together with other apps who
happen to use Lucene.
2) Those that are deployed on web servers usually include lucene.jar in
their classpath and are loaded by a different class loader than the rest ...

So we're really talking about deployments where Lucene is a common, shared
library between all apps ...

And I guess that what bothers me the most is that it feels to me like we're
trying to protect people from stuff we haven't yet received complaints on
(at least none that I'm aware of), while we're hurting the programming
experience of others ... almost recklessly. I'd hope we can find a way
around that, because today I pass the same Version value around everywhere,
and it's simply inconvenient. Just yesterday people complained about the
need to call writer.commit() after new IW() if they want to open a reader
... one-liner inconvenience vs. dozen of lines here -- point is, what's
perceived as unnecessary code DOES bother people ... only here it's just a
setting thing, and my proposal is not to make it generically static. So
let's not get caught on that 'static-ness'. And besides, if you ask me
- variables
like Version, that are needed in so many places, are usually made static ...
but not in Lucene ...

So if possible ... I'd like to think how we can fix/improve the use of
Version, in ways that won't break apps. Because the fact to the matter is -
we invented Version to allow for changes w/o breaking back-compat, while the
backwards section in CHANGES seems to grow from release to release (I know -
I'm partly to blame for it :)), and another fact is that I don't remember
even one complaint about a change which broke back-compat. People have
raised this issue numerous times in the past, even proposed all sorts of
contracts and definitions on how we can be 'allowed' to break back-compat
... but nothing came out of it.

The fact that we are not able to reach consensus doesn't mean the problem
doesn't bother many out there. And ignoring the fact that currently the API
looks cluttered is not doing any good. There must be away to allow some apps
out there (IMO the majority) to set that Version thing once, and let Lucene
use that value everywhere else ... while for others to pass it along as much
as they want.

Shai

On Tue, Apr 13, 2010 at 7:41 PM, Uwe Schindler u...@thetaphi.de wrote:

  Hi Shai,



 one of the problem I have is: That is a static default! We want to get rid
 of them (and did it mostly, only some relicts remain), so there are no plans
 to reimplement such a thing again. The badest one is
 BooleanQuery.maxClauseCount. The same applies to all types of sysprops. As
 Lucene and solr is mostly running in servlet containers, this type of thing
 makes web applications no longer isolated. This is also a general contract
 for libraries: never ever rely on sysprops or statics.



 Uwe



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Shai Erera [mailto:ser...@gmail.com]
 *Sent:* Tuesday, April 13, 2010 5:27 PM
 *To:* java-dev@lucene.apache.org
 *Subject:* Proposal about Version API relaxation



 Hi

 I'd like to propose a relaxation on the Version API. Uwe, please read the
 entire email before you reply :).

 I was thinking, following a question on the user list, that the
 Version-based API may not be very intuitive to users, especially those who
 don't care about versioning, as well as very inconvenient. So there are two
 issues here:
 1) How should one use Version smartly so that he keeps backwards
 compatibility. I think we all know the answer, but a Wiki page with some
 best practices tips would really help users use it.
 2) How can one write sane code, which doesn't pass versions all over the
 place if: (1) he doesn't care about versions, or (2) he cares, and sets the
 Version to the same value in his app, in all places.

 Also, I think that today we offer a flexibility to users, to set different
 Versions on different objects in the life span of their application - which
 is a good flexibility but can also lead people to shoot themselves in the
 legs if they're not careful -- e.g. upgrading Version across their app, but
 failing to do so for one or two places ...

 So the change I'd like to propose is to mostly alleviate (2) and better
 protect users - I DO NOT PROPOSE TO GET RID OF Version :).

 I was thinking

Re: Proposal about Version API relaxation

2010-04-13 Thread Shai Erera

 Because the version mechanism is not a single value for the entire library
but rather feature by feature. I don't see how a global setter can help.

That's only true if we believe people use different Version values in
different places of their code ... and note that they will still be able to.
I'm not proposing to take out Version from the ctors, just to add an
additional default-version the app can set and use.So if the app doesn't
want to do it .. it doesn't have to.

Shai

On Tue, Apr 13, 2010 at 9:40 PM, DM Smith dmsmith...@gmail.com wrote:

 I like the concept of version, but I'm concerned about it too.

 The current Version mechanism allows one to use more than one Version in
 their code. Imagine that we are at 3.2 and one was unable to upgrade to a
 most version for a particular feature. Let's also suppose that at 3.2 a new
 feature was introduced and was taken advantage of. But at 3.5 that new
 feature is versioned but one is unable to upgrade for it, too. Now what? Use
 3.0 for the one feature and 3.2 for the other?

 What about the interoperability of versioned features? Does a version 3.0
 class play well with a 3.2 versioned class? How do we test that?

 A long term issue is that of bw compat for the version itself. The bw
 compat contract is two fold: API and index. The API has a shorter lifetime
 of compatibility than that of an index. How does one deprecate a particular
 version for the api but not the index? How does one know whether one
 versioned feature impacts the index and an other does not?

 I'm hoping that I'm imagining a problem that will never actually arise.

 Shai, to your suggestion: Because the version mechanism is not a single
 value for the entire library but rather feature by feature. I don't see how
 a global setter can help.

 -- DM


 On 04/13/2010 11:27 AM, Shai Erera wrote:

 Hi

 I'd like to propose a relaxation on the Version API. Uwe, please read the
 entire email before you reply :).

 I was thinking, following a question on the user list, that the
 Version-based API may not be very intuitive to users, especially those who
 don't care about versioning, as well as very inconvenient. So there are two
 issues here:
 1) How should one use Version smartly so that he keeps backwards
 compatibility. I think we all know the answer, but a Wiki page with some
 best practices tips would really help users use it.
 2) How can one write sane code, which doesn't pass versions all over the
 place if: (1) he doesn't care about versions, or (2) he cares, and sets the
 Version to the same value in his app, in all places.

 Also, I think that today we offer a flexibility to users, to set different
 Versions on different objects in the life span of their application - which
 is a good flexibility but can also lead people to shoot themselves in the
 legs if they're not careful -- e.g. upgrading Version across their app, but
 failing to do so for one or two places ...

 So the change I'd like to propose is to mostly alleviate (2) and better
 protect users - I DO NOT PROPOSE TO GET RID OF Version :).

 I was thinking that we can add on Version a DEFAULT version, which the
 caller can set. So Version.setDefault and Version.getDefault will be added,
 as static members (more on the static-ness of it later). We then change the
 API which requires Version to also expose an API which doesn't require it,
 and that API will call Version.getDefault(). People can use it if they want
 to ...

 Few points:
 1) As a default DEFAULT Version is controversial, I don't want to propose
 it, even though I think Lucene can define the DEFAULT to be the latest.
 Instead, I propose that Version.getDefault throw a
 DefaultVersionNotSetException if it wasn't set, while an API which relies on
 the default Version is called (I don't want to return null, not sure how
 safe it is).
 2) That DEFAULT Version is static, which means it will affect all indexing
 code running inside the JVM. Which is fine:
 2.1) Perhaps all the indexing code should use the same Version
 2.2) If you know that's not the case, then pass Version to the API which
 requires it - you cannot use the 'default Version' API -- nothing changes
 for you.
 One case is missing -- you might not know if your code is the only
 indexing code which runs in the JVM ... I don't have a solution to that, but
 I think it'll be revealed pretty quickly, and you can change your code then
 ...

 So to summarize - the current Version API will remain and people can still
 use it. The DEFAULT Version API is meant for convenience for those who don't
 want to pass Version everywhere, for the reasons I outlined above. This will
 also clean our test code significantly, as the tests will set the DEFAULT
 version to TEST_VERSION_CURRENT at start ...

 The changes to the Version class will be very simple.

 If people think that's acceptable, I can open an issue and work on it.

 Shai



 -
 To unsubscribe, e

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855870#action_12855870
]

Shai Erera commented on LUCENE-2386:

I'm not sure if we're arguing about the same thing here ... why when I open an
IW on empty Directory I need an empty segment that's created, and from now on
never changed, populated or even read? That just seems wrong to me ... when I
fixed the tests to not rely on the buggy behavior, I noticed several which
count the list of commits (especially the IDP ones) w/ a documentation like 1
for opening + N for committing ...

It just looks weird that when you open IW a commit happens, a set of empty
files are created, but from now on they are never modified, until IDP kicks in,
after the second commit ... it's nothing like initing the Directory to be able
to receive input ..

And I don't know what's the benefit of doing new IW() following by
IR.open() ... that IR will always see 0 documents, until you call reopen (if
commit happened in between). So what's the convenience here? that your code can
call IR.open once, and from that point forward just 'reopen()'? That seems low
advantage to me, really. Maybe what we should do is fix IR.open to return a
null IR in case the directory hasn't been populated w/ anything yet. Then you
can check easily if you should call open() (==null) or reopen (otherwise). Or
create a blank stub of IR which emulates an empty Dir, and when reopen is
called works well (if the Directory is not empty now) ...

BTW, FWIW, Solr's code did not break from this change at all ... it was the
combination of FSDir and NoLF/SingleInstanceLF that broke some tests that used
it ... I don't know how many apps out there are using that combination, but I'd
bet it's small? I use that combination, however in my case an IR is opened only
after a commit signal/event is raised (so I don't check isCurrent often or
attempt to reopen()). What I'm trying to say is that this combination is
dangerous, and the application needs to ensure that only one IW is open at any
given time, and I'm sure such apps are more sophisticated then opening IW and
then IR just for the convenience of it.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch, LUCENE-2386.patch

I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh
Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems
unnecessarily, and kind of brings back an autoCommit mode, in a strange way
... why do we need that commit? Do we really expect people to open an
IndexReader on an empty Directory which they just passed to an IW w/
create=true? If they want, they can simply call commit() right away on the IW
they created.
I ran into this when writing a test which committed N times, then compared
the number of commits (via IndexReader.listCommits) and was surprised to see
N+1 commits.
Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter
jumping on me .. so the change might not be that simple. But I think it's
manageable, so I'll try to attack it (and IFD specifically !) back :).

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength


[ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855873#action_12855873
 ] 

Shai Erera commented on LUCENE-2316:


Well ... dir.fileLength is also used by SegmentInfos.sizeInBytes to compute the 
size of all the files in the Directory. If we remove fileLength, then SI will 
need to call dir.openInput.length() and the close it? Seems like a lot of work 
to me, for just obtaining the length of the file. So I agree that if you have 
an IndexInput at hand, you should call its length() method rather than 
Dir.fileLength. But otherwise, if you just have a name at hand, a 
dir.fileLength is convenient?

I'm also ok w/ the bw break rather than going through the new/deprecate cycle.

 Define clear semantics for Directory.fileLength
 ---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


 On this thread: 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
  it was mentioned that Directory's fileLength behavior is not consistent 
 between Directory implementations if the given file name does not exist. 
 FSDirectory returns a 0 length while RAMDirectory throws FNFE.
 The problem is that the semantics of fileLength() are not defined. As 
 proposed in the thread, we'll define the following semantics:
 * Returns the length of the file denoted by codename/code if the file 
 exists. The return value may be anything between 0 and Long.MAX_VALUE.
 * Throws FileNotFoundException if the file does not exist. Note that you can 
 call dir.fileExists(name) if you are not sure whether the file exists or not.
 For backwards we'll create a new method w/ clear semantics. Something like:
 {code}
 /**
  * @deprecated the method will become abstract when #fileLength(name) has 
 been removed.
  */
 public long getFileLength(String name) throws IOException {
   long len = fileLength(name);
   if (len == 0  !fileExists(name)) {
 throw new FileNotFoundException(name);
   }
   return len;
 }
 {code}
 The first line just calls the current impl. If it throws exception for a 
 non-existing file, we're ok. The second line verifies whether a 0 length is 
 for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

[
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875
]

Shai Erera commented on LUCENE-2392:

Mike - it'll also be great if we can store the length of the document in a
custom way. I think what I'm saying is that if we can open up the norms
computation to custom code - that will do what I want, right? Maybe we can have
a class like DocLengthProvider which apps can plug in if they want to customize
how that length is computed. Wherever we write the norms, we'll call that impl,
which by default will do what Lucene does today?
I think though that it's not a field-level setting, but an IW one?

Enable flexible scoring
---

Key: LUCENE-2392
URL: https://issues.apache.org/jira/browse/LUCENE-2392
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 3.1

Attachments: LUCENE-2392.patch

This is a first step (nowhere near committable!), implementing the
design iterated to in the recent Baby steps towards making Lucene's
scoring more flexible java-dev thread.
The idea is (if you turn it on for your Field; it's off by default) to
store full stats in the index, into a new _X.sts file, per doc (X
field) in the index.
And then have FieldSimilarityProvider impls that compute doc's boost
bytes (norms) from these stats.
The patch is able to index the stats, merge them when segments are
merged, and provides an iterator-only API. It also has starting point
for per-field Sims that use the stats iterator API to compute boost
bytes. But it's not at all tied into actual searching! There's still
tons left to do, eg, how does one configure via Field/FieldType which
stats one wants indexed.
All tests pass, and I added one new TestStats unit test.
The stats I record now are:
- field's boost
- field's unique term count (a b c a a b -- 3)
- field's total term count (a b c a a b -- 6)
- total term count per-term (sum of total term count for all docs
that have this term)
Still need at least the total term count for each field.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems


[ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855877#action_12855877
 ] 

Shai Erera commented on LUCENE-2373:


I'd rather not count on file length as well ... so a put/getTermDictSize method 
on Codec will allow one to implement it however one wants, if running on HDFS 
for example?

 Change StandardTermsDictWriter to work with streaming and append-only 
 filesystems
 -

 Key: LUCENE-2373
 URL: https://issues.apache.org/jira/browse/LUCENE-2373
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Andrzej Bialecki 
 Fix For: 3.1


 Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
 of the terms dict into a place near the start of the output data file. This 
 however made it impossible to use Lucene with append-only filesystems such as 
 HDFS.
 In the post-flex trunk the following code in StandardTermsDictWriter 
 initiates this:
 {code}
 // Count indexed fields up front
 CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
 out.writeLong(0); // leave space for end 
 index pointer
 {code}
 and completes this in close():
 {code}
   out.seek(CodecUtil.headerLength(CODEC_NAME));
   out.writeLong(dirStart);
 {code}
 I propose to change this layout so that this pointer is stored simply at the 
 end of the file. It's always 8 bytes long, and we known the final length of 
 the file from Directory, so it's a single additional seek(length - 8) to read 
 it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring

2010-04-12 Thread Shai Erera

I'm not sure Robert where did I propose to shove random statistics into the
index? Lucene calculated a doc length today which some in the
academy/research here disagree w/ how it's done. So instead of attempting to
fix it for all, I think it'd be great if one can define what is the doc
Length as one perceives it. Why is that problematic?

What Mike opened is an issue titled enable flexible scoring ... what I'm
asking for falls under that hood?

Also, maybe we should have that discussion on the issue?

Shai

On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir rcm...@gmail.com wrote:

I disagree. I think what Mike has defined here is way beyond a baby-step:
its all the stats needed to support modern IR models in Lucene: BM25,
additional vector space algorithms, divergence from randomness, and language
modelling.

I think the ability to calculate your own random statistics and shove them
into the index (this would be messy like how to get access to the aggregates
you need anyway) is something different entirely, best left to research
systems.

You can't even do that with Terrier now.

On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) j...@apache.orgwrote:

[
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875]

Shai Erera commented on LUCENE-2392:

Mike - it'll also be great if we can store the length of the document in a
custom way. I think what I'm saying is that if we can open up the norms
computation to custom code - that will do what I want, right? Maybe we can
have a class like DocLengthProvider which apps can plug in if they want to
customize how that length is computed. Wherever we write the norms, we'll
call that impl, which by default will do what Lucene does today?
I think though that it's not a field-level setting, but an IW one?

Enable flexible scoring
---

Attachments: LUCENE-2392.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

--
Robert Muir
rcm...@gmail.com

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855892#action_12855892
]

Shai Erera commented on LUCENE-2386:

bq. what is the proper way (after this fix) to open an IR over possibly-empty
directory?

You can simply call commit() immediately after you open IW. If that's what you
need then it will work for you.

You're right that if I add docs, deletes and them commits, I'll get an empty
segment. So is if you do new IW() and then iw.close() w/ no addDocument in
between. The point here was that we should not create a commit unless the user
has specifically asked for it. Calling close() means asking for a commit, per
close semantics and contract. But if the app called new IW, add docs and
crashed in the middle, the Directory will still remain empty ... which is sort
of what, IMO, should happen.

I agree it's a matter of perspective. I think that when autoCommit was removed,
so should have been this code. I don't know if it was left behind for a good
reason, or simply because when someone tried to do it, he found out it's not
that simple (like I have :)).

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch, LUCENE-2386.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2392) Enable flexible scoring

[
https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855913#action_12855913
]

Shai Erera commented on LUCENE-2392:

I'd like to withdraw my request from above. I misunderstood that the stats I
need are stored per-field per-doc. So that will allow me to compute the
docLength as I want.

Enable flexible scoring
---

Attachments: LUCENE-2392.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855924#action_12855924
]

Shai Erera commented on LUCENE-2386:

I don't think that people need to write that emptiness-detection-then-commit
code ... if they care, they can simply immediately call commit() after they
open IW.

bq. Isn't opening IW with CREATE* mode called specifically asking for?

It depends on how you interpret the mode ... for example, you cannot pass
OpenMode.APPEND for an empty Directory, because IW throws an exception. The
modes are just meant to tell IW how to behave:
* APPEND - I know there is an index in the Directory, and I'd like to append to
it.
* CREATE - I don't care if there is an index in the Directory -- create a new
one, zeroing out all segments.
* CREATE_OR_APPEND - If there is an index, open it, otherwise create a new one.

So if you pass CREATE on an already populated index, IW doesn't do the implicit
commit, until you call commit() yourself. But if you pass CREATE on an empty
index, IW suddenly calls commit()? That's just an inconsistency that's meant to
allow you to open an IR immediately after new IW() call, irregardless of what
was there? And if you open that IR, then if the index was populated you see the
previous set of documents, but if it wasn't you see nothing, even though you
meant to say override what's there?

I've checked what FileOutputStream does, using the following code:
{code}
File file = new File(d:/temp/tmpfile);
FileOutputStream fos = new FileOutputStream(file);
fos.write(3);
fos.close();

fos = new FileOutputStream(file);
FileInputStream fis = new FileInputStream(file);
System.out.println(fis.read());
{code}

* Second line creates an empty file immediately, not waiting for close() or
flush() -- which resembles the behavior that you're suggesting we should take
w/ IW (which is the 'today's behavior')
* Forth line closes the file, flushing and writing the content.
* Fifth line *recreates* the file, empty, again, w/o calling close. So it zeros
out the file content immediately, even before you wrote a single piece of byte
to it.
* Sixth+Seventh line proves it by attempting to read from the file, and the
output printed is -1.

I've wrapped the FOS w/ a BufferedOS and the behavior is still the same. So I'm
trying to show is that we don't fully adhere to the CREATE mode, and rightfully
if you ask me - we shouldn't zero out the segments until the application called
commit(). But we choose to adhere differently to the CREATE* mode if the index
is already populated. That's an inconsistent behavior, at least in my
perspective. It's also harder to explain and document, e.g. you should call
commit() if you used CREATE, in case you want to zero out everything
immediately, and the Directory is not empty, but you don't need to call
commit() if the directory was empty, Lucene will do it for you. -- so now how
will the app know if it should call commit()? It will need to write a sort of
emptiness-detection-then-commit?

I am willing to consider the following semantics:
* APPEND - assumes an index exists and open it.
* CREATE - zeros out everything that's in the directory *immediately*, and also
prepares an empty directory.
* CREATE_OR_APPEND - either loads an existing index, or is able to work on the
empty directory. No implicit commit is happening by IW if the index does not
exist.

But I think CREATE is too dangerous, and so I prefer to stick w/ the proposed
change to the patch so far -- if you open an index in CREATE*, you should call
commit before you can read it. That will adhere to the semantics of what the
application wanted, whether it meant to zero out an existing Directory, or
create a new one from scratch.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch, LUCENE-2386.patch

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856063#action_12856063
]

Shai Erera commented on LUCENE-2386:

So just call new IW(), then rollback and ensure dir.listAll() returns an
empty list? Or also index stuff, making sure a flush occurs and then rollback?
I'm not sure that the latter is related to that issue ...

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch, LUCENE-2386.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch includes the proposed test in TestIndexWriter. I think this is ready for 
commit, if there are no more objections.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2386.


Lucene Fields: [New, Patch Available]  (was: [New])
   Resolution: Fixed

Committed revision 932868.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1709) Parallelize Tests


[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855713#action_12855713
 ] 

Shai Erera commented on LUCENE-1709:


Committed revision 932878 with the following:
# benchmark tests force sequential run
# threadsPerProcessor defaults to 1 and can be overridden by 
-DthreadsPerProcessor=value
# A CHANGES entry

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: svn commit: r932873 - /lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java

2010-04-11 Thread Shai Erera

Sorry about that ...

On Sun, Apr 11, 2010 at 3:10 PM, uschind...@apache.org wrote:

 Author: uschindler
 Date: Sun Apr 11 12:10:57 2010
 New Revision: 932873

 URL: http://svn.apache.org/viewvc?rev=932873view=rev
 Log:
 add missing license header

 Modified:

  
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java

 Modified:
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 URL:
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java?rev=932873r1=932872r2=932873view=diff

 ==
 ---
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 (original)
 +++
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
 Sun Apr 11 12:10:57 2010
 @@ -1,5 +1,22 @@
  package org.apache.lucene.index;

 +/**
 + * Licensed to the Apache Software Foundation (ASF) under one or more
 + * contributor license agreements.  See the NOTICE file distributed with
 + * this work for additional information regarding copyright ownership.
 + * The ASF licenses this file to You under the Apache License, Version 2.0
 + * (the License); you may not use this file except in compliance with
 + * the License.  You may obtain a copy of the License at
 + *
 + * http://www.apache.org/licenses/LICENSE-2.0
 + *
 + * Unless required by applicable law or agreed to in writing, software
 + * distributed under the License is distributed on an AS IS BASIS,
 + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 + * See the License for the specific language governing permissions and
 + * limitations under the License.
 + */
 +
  import java.io.FileNotFoundException;

  /**

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855727#action_12855727
 ] 

Shai Erera commented on LUCENE-2386:


Committed revision 932917 for the revert.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Fixes IndexFileDeleter, adds a proper test to TestIndexWriter. Haven't run all
the tests yet though, but the added test passes now with the fix.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855767#action_12855767
]

Shai Erera commented on LUCENE-2386:

About IndexReader.listCommits ... the javadocs state this There must be at
least one commit in the Directory, else this method throws
java.io.IOException.. So I'll change it to reflect the right exception type is
thrown (IndexNotFoundException) and revert the change to DirReader.listCommits
which returns an empty list.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch w/ proposed fixes. All tests pass, including Solr's :).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-10 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch updated to latest rev. + the proposed name change -- 
IndexNotFoundException. All tests pass. I plan to commit this later today.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855344#action_12855344
 ] 

Shai Erera commented on LUCENE-2386:


Ok I've added the following to DirReader:

{code}
try {
  latest.read(dir, codecs);
} catch (FileNotFoundException e) {
  if (e.getMessage().startsWith(no segments* file found in)) {
// Might be that the Directory is empty, in which case just return an
// empty collection.
return Collections.emptyList();
  } else {
throw e;
  }
}
{code}

And now that test passes.

I'll continue discovering tests that fail ... probably backwards will have its 
share too :).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855369#action_12855369
 ] 

Shai Erera commented on LUCENE-2386:


I already did that ... just didn't post back. Created 
SegmentsFileNotFoundException.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1879) Parallel incremental indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855379#action_12855379
 ] 

Shai Erera commented on LUCENE-1879:


I have found such version ... and it fails too :). At least the one I received.

But never mind that ... as long as we both agree the implementation should 
change. I didn't mean to say anything bad about what you did .. I know the 
limitations you had to work with.

 Parallel incremental indexing
 -

 Key: LUCENE-1879
 URL: https://issues.apache.org/jira/browse/LUCENE-1879
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 3.1

 Attachments: parallel_incremental_indexing.tar


 A new feature that allows building parallel indexes and keeping them in sync 
 on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
 Find details on the wiki page for this feature:
 http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
 Discussion on java-dev:
 http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

Patch fixes all tests as well as changes to IndexWriter, IndexFileDeleter,
DirectoryReader and SegmentInfos.

I'd like to commit this shortly, before all the files get changed by a
malicious other commit :). (kidding of course)

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855457#action_12855457
 ] 

Shai Erera commented on LUCENE-2386:


Ok sounds good. Is there a preferred package for exceptions? Or is o.a.l.index 
ok?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Move NoDeletionPolicy to core

2010-04-08 Thread Shai Erera

Hi

I've noticed benchmark has a NoDeletionPolicy class and I was wondering if
we can move it to core. I might want to use it for the parallel index stuff,
but I think it'll also fit nicely in core, together with the other No*
classes. In addition, this class should be made a singleton.

If moving to core is acceptable, do you think any bw policy needs to be
enforced (such as deprecating the one in benchmark and reference the one in
core? I'll also want to change the package name from o.a.l.benchmark.utils
to o.a.l.index, where the other IDPs are.

Simple move and change (and update to benchmark algs which use it.

Shai

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

[
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854885#action_12854885
]

Shai Erera commented on LUCENE-2074:

Uwe, must this be coupled with that issue? This one waits for a long time (why?
for JFlex 1.5 release?) and protecting against a huge buffer allocation can be
a real quick and tiny fix. And this one also focuses on getting Unicode 5 to
work, which is unrelated to the buffer size. But the buffer size is not a
critical issue either that we need to move fast with it ... so it's your call.
Just thought they are two unrelated problems.

Use a separate JFlex generated Unicode 4 by Java 5 compatible
StandardTokenizer
---

Key: LUCENE-2074
URL: https://issues.apache.org/jira/browse/LUCENE-2074
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Fix For: 3.1

Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch,
LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch,
LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch,
LUCENE-2074.patch

The current trunk version of StandardTokenizerImpl was generated by Java 1.4
(according to the warning). In Java 3.0 we switch to Java 1.5, so we should
regenerate the file.
After regeneration the Tokenizer behaves different for some characters.
Because of that we should only use the new TokenizerImpl when
Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854887#action_12854887
 ] 

Shai Erera commented on LUCENE-2074:


bq. I plan to commit this soon! 

That's great news !

BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check 
it in somewhere? because it hasn't been released last I checked. I'm asking for 
general knowledge, because I know the scripts are downloading it, or rely on it 
to exist somewhere.

In that case, then yes, let's fix it here.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)


[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854920#action_12854920
 ] 

Shai Erera commented on LUCENE-1482:


I still think that calling isDebugEnabled is better, because the message 
formatting stuff may do unnecessary things like casting, autoboxing etc. IMO, 
if logging is enabled, evaluating it twice is not a big deal ... it's a simple 
check.

I'm glad someone here thinks logging will be useful though :). I wish there 
will be quorum here to proceed w/ that.

Note that I also offered to not create any dependency on SLF4J, but rather 
extract infoStream to a static InfoStream class, which will avoid passing it 
around everywhere, and give the flexibility to output stuff from other classes 
which don't have an infoStream at hand.

 Replace infoSteram by a logging framework (SLF4J)
 -

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
 slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar


 Lucene makes use of infoStream to output messages in its indexing code only. 
 For debugging purposes, when the search application is run on the customer 
 side, getting messages from other code flows, like search, query parsing, 
 analysis etc can be extremely useful.
 There are two main problems with infoStream today:
 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
 other classes I need to either expose an API or propagate infoStream to all 
 classes (see for example DocumentsWriter, which receives its infoStream 
 instance from IndexWriter).
 2. I can either turn debugging on or off, for the entire code.
 Introducing a logging framework can allow each class to control its logging 
 independently, and more importantly, allows the application to turn on 
 logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
 I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
 as it names states, a facade over different logging frameworks. As such, you 
 can include the slf4j.jar in your application, and it recognizes at deploy 
 time what is the actual logging framework you'd like to use. SLF4J comes with 
 several adapters for Java logging, Log4j and others. If you know your 
 application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
 your classpath, and your logging statements will use Java logging underneath 
 the covers.
 This makes the logging code very simple. For a class A the logger will be 
 instantiated like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
 }
 And will later be used like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
   public void foo() {
 if (logger.isDebugEnabled()) {
   logger.debug(message);
 }
   }
 }
 That's all !
 Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
 (but I assume it's fast also over other logging frameworks).
 The important thing is, every class controls its own logger. Not all classes 
 have to output logging messages, and we can improve Lucene's logging 
 gradually, w/o changing the API, by adding more logging messages to 
 interesting classes.
 I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1709) Parallelize Tests

[
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855020#action_12855020
]

Shai Erera commented on LUCENE-1709:

Robert, I will commit the patch, seems good to do anyway. We can handle the ant
jars separately later.

And ths hang behavior is exactly what I experience, including the
FileInputStream thing. Only on my machine, when I took a thread dump, it showed
that Ant waits on FIS.read() ...

Robert - to remind you that even with the patch which forces junit to use a
separate temp folder per thread, it still hung ...

Parallelize Tests
-

Key: LUCENE-1709
URL: https://issues.apache.org/jira/browse/LUCENE-1709
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
Fix For: 3.1

Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch,
LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch,
LUCENE-1709.patch, runLuceneTests.py

Original Estimate: 48h
Remaining Estimate: 48h

The Lucene tests can be parallelized to make for a faster testing system.
This task from ANT can be used:
http://ant.apache.org/manual/CoreTasks/parallel.html
Previous discussion:
http://www.gossamer-threads.com/lists/lucene/java-dev/69669
Notes from Mike M.:
{quote}
I'd love to see a clean solution here (the tests are embarrassingly
parallelizable, and we all have machines with good concurrency these
days)... I have a rather hacked up solution now, that uses
-Dtestpackage=XXX to split the tests up.
Ideally I would be able to say use N threads and it'd do the right
thing... like the -j flag to make.
{quote}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

Move NoDeletionPolicy from benchmark to core


 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1


As the subject says, but I'll also make it a singleton + add some unit tests, 
as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

IndexWriter commits unnecessarily on fresh Directory


 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... 
why do we need that commit? Do we really expect people to open an IndexReader 
on an empty Directory which they just passed to an IW w/ create=true? If they 
want, they can simply call commit() right away on the IW they created.

I ran into this when writing a test which committed N times, then compared the 
number of commits (via IndexReader.listCommits) and was surprised to see N+1 
commits.

Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
jumping on me .. so the change might not be that simple. But I think it's 
manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core


 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Move NoDeletionPolicy to core, adds javadocs + TestNoDeletionPolicy. Also 
includes the relevant changes to benchmark (algorithms + CreateIndexTask).
I've fixed a typo I had in NoMergeScheduler - not related to this issue, but 
since it was just a typo, thought it's no harm to do it here.

Tests pass. Planning to commit shortly.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855131#action_12855131
 ] 

Shai Erera commented on LUCENE-2386:


Took a look at IndexFileDeleter, and located to offending code segment which is 
responsible for the IndexCorruptException:
{code}
if (currentCommitPoint == null) {
  // We did not in fact see the segments_N file
  // corresponding to the segmentInfos that was passed
  // in.  Yet, it must exist, because our caller holds
  // the write lock.  This can happen when the directory
  // listing was stale (eg when index accessed via NFS
  // client with stale directory listing cache).  So we
  // try now to explicitly open this commit point:
  SegmentInfos sis = new SegmentInfos();
  try {
sis.read(directory, segmentInfos.getCurrentSegmentFileName(), codecs);
  } catch (IOException e) {
throw new CorruptIndexException(failed to locate current segments_N 
file);
  }
{code}

Looks like this code protects against a real problem, which was raised on the 
list a couple of times already - stale NFS cache. So I'm reluctant to remove 
that check ... thought I still think we should differentiate between a newly 
created index on a fresh Directory, to a stale NFS problem. Maybe we can pass a 
boolean isNew or something like that to the ctor, and if it's a new index and 
the last commit point is missing, IFD will not throw the exception, but 
silently ignore that? So the code would become something like this:
{code}
if (currentCommitPoint == null  !isNew) {
   
}
{code}

Does this make sense, or am I missing something?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core


[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855140#action_12855140
 ] 

Shai Erera commented on LUCENE-2385:


I did that first, but then remembered that when I did that in the past, people 
were unable to apply my patches, w/o doing the svn move themselves. Anyway, for 
this file it's not really important I think - a very simple and tiny file, w/ 
no history to preserve? Is that ok for this file (b/c I have no idea how to do 
the svn move now ... after I've made all the changes already) :)

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855148#action_12855148
]

Shai Erera commented on LUCENE-2386:

Looking at IFD again, I think a boolean ctor arg is not required. What I can do
is check if any Lucene file has been seen (in the for-loop iteration on the
Directory files), and if not, then deduce it's a new Directory, and skip that
'if' check. I'll give it a shot.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core


 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Is it better now?

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core


[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855155#action_12855155
 ] 

Shai Erera commented on LUCENE-2385:


Forgot to mention that the only move I made was of NoDeletionPolicy:

svn move 
contrib/benchmark/src/java/org/apache/lucene/benchmark/utils/NoDeletionPolicy.java
 src/java/org/apache/lucene/index/NoDeletionPolicy.java

I'll remember that in the future Uwe - thanks for the heads up !

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core


 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2385.


Resolution: Fixed

Committed revision 932129.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

First stab at this. Patch still missing CHANGES entry, and I haven't run all
the tests, just TestIndexWriter. With those changes it passes. One thing that I
think should be fixed is testImmediateDiskFull - if I don't add
writer.commit(), the test fails, because dir.getRecomputeActualSizeInBytes
returns 0 (no RAMFiles yet), and then the test succeeds at adding one document.
So maybe just change the test to set maxSizeInBytes to '1', always?

TestNoDeletionPolicy is not covered by this patch (should be fixed as well,
because now the number of commits is exactly N and not N+1). Will fix it
tomorrow.

Anyway, it's really late now, so hopefully some fresh eyes will look at it
while I'm away, and comment on the proposed changes. I hope I got all the
changes to the tests right.

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855265#action_12855265
]

Shai Erera commented on LUCENE-2386:

bq. Maybe change testImmediateDiskFull to set max allowed size to max(1,
current-usage)?

Good idea ! Did it and it works.

Now ... one thing I haven't mentioned is the bw break. This is a behavioral bw
break, which specifically I'm not so sure we should care about, because I
wonder how many apps out there rely on being able to open a reader before they
ever commited on a fresh new index. So what do you think - do this change
anyway, OR ... utilize Version to our aid? I.e., if the Version that was passed
to IWC is before LUCENE_31, we keep the initial commit, otherwise we don't do
it? Pros is that I won't need to change many of the tests because they still
use the LUCENE_30 version (but that is not a strong argument), so it's a weak
Pro. Cons is that IW will keep having that doCommit handling in its ctor, only
now w/ added comments on why this is being kept around etc.

What do you think?

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

TestCodecs running time

2010-04-08 Thread Shai Erera

Hi

I've noticed that TestCodecs takes an insanely long time to run on my
machine - between 35-40 seconds. Is that expected?
The reason why it runs so long, seems to be that its threads make (each)
4000 iterations ... is that really required to ensure correctness?

Shai

Re: Controlling the maximum size of a segment during indexing

2010-04-08 Thread Shai Erera

I'm not sure .. but did you set the RAMBufferSizeMB on IWC? Doesn't look
like it, and the default is 16 MB, which can explain why it doesn't flush
before that.

Shai

On Fri, Apr 9, 2010 at 8:01 AM, Lance Norskog goks...@gmail.com wrote:

 Here is a Java unit test that uses the LogByteSizeMergePolicy to
 control the maximum size of segment files during indexing. That is, it
 tries. It does not succeed. Will someone who truly understands the
 merge policy code please examine it. There is probably one tiny
 parameter missing.

 It adds 20 documents that each are 100k in size.

 It creates an index in a RAMDirectory which should have one segment
 that's a tad over 1mb, and then a set of segments that are a tad over
 500k. Instead, the data does not flush until it commits, writing one
 5m segment.


 -
 org.apache.lucene.index.TestIndexWriterMergeMB

 ---

 package org.apache.lucene.index;

 /**
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the License); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an AS IS BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */

 import java.io.IOException;

 import org.apache.lucene.analysis.WhitespaceAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.FieldSelectorResult;
 import org.apache.lucene.document.Field.Index;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.util.LuceneTestCase;

 /*
  * Verify that segment sizes are limited to # of bytes.
  *
  * Sizing:
  *  Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x)
  *  Min MB is 10k.
  *  Each document is 100k.
  *  mergeSegments=2
  *  MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x)
  *
  *  This test should cause the ram buffer to flush after 10 documents,
 and create a CFS a little over 1meg.
  *  The later documents should be flushed to disk every 5-6 documents,
 and create CFS files a little over 0.5meg.
  */


 public class TestIndexWriterMergeMB extends LuceneTestCase {
  private static final int MERGE_FACTOR = 2;
  private static final double RAMBUFFER_MB = 1.0;
  static final double MIN_MB = 0.01d;
  static final double MAX_MB = 0.5d;
  static final double SLOP_FACTOR = 1.2d;
  static final double MB = 1000*1000;
  static String VALUE_100k = null;

  // Test controlling the mergePolicy for max # of docs
  public void testMaxMergeMB() throws IOException {
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(
TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT));

LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy();
config.setMergePolicy(mergeMB);
mergeMB.setMinMergeMB(MIN_MB);
mergeMB.setMaxMergeMB(MAX_MB);
mergeMB.setUseCompoundFile(true);
mergeMB.setMergeFactor(MERGE_FACTOR);
config.setMaxBufferedDocs(100);// irrelevant
 but the next line fails without this.
config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
MergeScheduler scheduler = new SerialMergeScheduler();
config.setMergeScheduler(scheduler);
IndexWriter writer = new IndexWriter(dir, config);

System.out.println(Start indexing);
for (int i = 0; i  50; i++) {
  addDoc(writer, i);
  printSegmentSizes(dir);
}
checkSegmentSizes(dir);
System.out.println(Commit);
writer.commit();
printSegmentSizes(dir);
checkSegmentSizes(dir);
writer.close();
  }

  // document that takes of 100k of RAM
  private void addDoc(IndexWriter writer, int i) throws IOException {
if (VALUE_100k == null) {
  StringBuilder value = new StringBuilder(10);
  for(int fill = 0; fill  10; fill ++) {
value.append('a');
  }
  VALUE_100k = value.toString();
}
Document doc = new Document();
doc.add(new Field(id, i + , Field.Store.YES,
 Field.Index.NOT_ANALYZED));
doc.add(new Field(content, VALUE_100k, Field.Store.YES,
 Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
  }


  private void checkSegmentSizes(Directory dir) {
try {
  String[] files = dir.listAll();
  for (String file : files) {
if

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855277#action_12855277
 ] 

Shai Erera commented on LUCENE-2386:


Apparently, there are more tests that fail ... lost count but easy fixing. I 
tried writing the following test:

{code}
  public void testNoCommits() throws Exception {
// Tests that if we don't call commit(), the directory has 0 commits. This 
has
// changed since LUCENE-2386, where before IW would always commit on a fresh
// new index.
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new 
IndexWriterConfig(TEST_VERSION_CURRENT, new 
WhitespaceAnalyzer(TEST_VERSION_CURRENT)));
assertEquals(expected 0 commits!, 0, IndexReader.listCommits(dir).size());
// No changes still should generate a commit, because it's a new index.
writer.close();
assertEquals(expected 1 commits!, 0, IndexReader.listCommits(dir).size());
  }
{code}

Simple test - validates that no commits are present following a freshly new 
index creation, w/o closing or committing. However, IndexReader.listCommits 
fails w/ the following exception:

{code}
java.io.FileNotFoundException: no segments* file found in 
org.apache.lucene.store.ramdirect...@2d262d26: files: []
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:652)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:323)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1033)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1023)
at 
org.apache.lucene.index.IndexReader.listCommits(IndexReader.java:1341)
at 
org.apache.lucene.index.TestIndexWriter.testNoCommits(TestIndexWriter.java:4966)
   
{code}

The failure occurs when SegmentInfos attempts to find segments.gen and fails. 
So I wonder if I should fix DirectoryReader to catch that exception and simply 
return an empty Collection .. or I should fix SegmentInfos at this point -- 
notice the files: [] at the end - I think that by adding a check to the 
following code (SegmentInfos, line 652) which validates that there were any 
files before throwing the exception, it'll still work properly and safely (i.e. 
to detect a problematic Directory). Will need probably to break away from the 
while loop and I guess fix some other things in upper layers ... therefore I'm 
not sure if I should not simply catch that exception in 
DirectoryReader.listCommits w/ proper documentation and be done w/ it. After 
all, it's not supposed to be called ... ever? or hardly ever?

{code}
  if (gen == -1) {
// Neither approach found a generation
throw new FileNotFoundException(no segments* file found in  + 
directory + : files:  + Arrays.toString(files));
  }
{code}

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1709) Parallelize Tests

2010-04-07 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1709:
---

Attachment: LUCENE-1709-2.patch

Since I had the changes on my local env. I thought it's best to generate a 
patch out of them, so they don't get lost. The patch doesn't cover the ant 
.jars, only the changes to common-build.xml as well as benchmark/build.xml

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-07 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2377.


Resolution: Fixed

Committed revision 931502.

 Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
 -

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2377.patch


 Benchmark allows one to set the MP and MS to use, by defining the class name 
 and then use reflection to instantiate them. However NoMP and NoMS are 
 singletons and therefore reflection does not work for them. Easy fix in 
 CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-04-07 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854588#action_12854588
 ] 

Shai Erera commented on LUCENE-2353:


Actually, we've reopened LUCENE-1709 to track that. This is not related to this 
issue's changes, but seems to be related to benchmark test in specifically. 
Please have a look there at a patch I've posted which forces benchmark tests to 
run in sequential mode. Additionally, you can 'ant test -Drunsequential=1' from 
the command line, benchmark's root folder, to achieve the same.
And it'd be great if you post the above on LUCENE-1709 as well -- because now I 
know I'm not the only one running into this :).

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Getting fsync out of the loop

2010-04-06 Thread Shai Erera

How often is fsync called? If it's just during calls to commit, then is that
that expensive? I mean, how often do you call commit?

If that's that expensive (do you have some numbers to share) then I think
that's be a neat idea. Though losing a few minutes worth of updates may
sometimes be unrecoverable, depending on the scenario, bur I guess for those
cases the 'standard way' should be used.

What if your background thread simply committed every couple of minutes?
What's the difference between taking the snapshot (which means you had to
call commit previously) and commit it, to call iw.commit by a backgroud
merge?

Shai

On Tue, Apr 6, 2010 at 5:11 PM, Earwin Burrfoot ear...@gmail.com wrote:

 So, I want to pump my IndexWriter hard and fast with documents.

 Removing fsync from FSDirectory helps. But for that I pay with possibility
 of
 index corruption, not only if my node suddenly loses
 power/kernelpanics, but also if it
 runs out of disk space (which happens more frequently).

 I invented the following solution:
 We write a special deletion policy that resembles SnapshotDeletionPolicy.
 At all times it takes hold of current synced commit and preserves
 it. Once every N minutes
 a special thread takes latest commit, syncs it and nominates as
 current synced commit. The
 previous one gets deleted.

 Now we are disastery-proof, and do fsync asynchronously from indexing
 threads. We pay for this with
 somewhat bigger transient disc usage, and probably losing a few
 minutes worth of updates in
 case of a crash, but that's acceptable.

 How does this sound?

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Getting fsync out of the loop

2010-04-06 Thread Shai Erera

Earwin - do you have some numbers to share on the running time of the
indexing application? You've mentioned that if you take out fsync into a BG
thread, the running time improves, but I'm curious to know by how much.

Shai

On Wed, Apr 7, 2010 at 2:26 AM, Earwin Burrfoot ear...@gmail.com wrote:

  Running out of disk space with fsync disabled won't lead to corruption.
  Even kill -9 the JRE process with fsync disabled won't corrupt.
  In these cases index just falls back to last successful commit.
 
  It's only power loss / OS / machine crash where you need fsync to
  avoid possible corruption (corruption may not even occur w/o fsync if
  you get lucky).

 Sorry to disappoint you, but running out of disk space is worse than kill
 -9.
 You can write down the file (to cache in fact), close it, all without
 getting any
 exceptions. And then it won't get flushed to disk because the disk is full.
 This can happen to segments file (and old one is deleted with default
 deletion
 policy). This can happen to fat freq/prox files mentioned in segments file
 (and yeah, the old segments file is deleted, so no falling back).

  What if your background thread simply committed every couple of minutes?
  What's the difference between taking the snapshot (which means you had
  to call commit previously) and commit it, to call iw.commit by a
 backgroud merge?
 --
  But: why do you need to commit so often?
 To see stuff on reopen? Yes, I know about NRT.

  You've reinvented autocommit=true!
 ?? I'm doing regular commits, syncing down every Nth of it.

  Doesn't this just BG the syncing?  Ie you could make a dedicated
  thread to do this.
 Yes, exactly, this BGs the syncing to a dedicated thread. Threads
 doing indexation/merging can continue unhampered.

  One possible win with this aproach is the cost of fsync should go
  way down the longer you wait after writing bytes to the file and
  before calling fsync.  This is because typically OS write caches
  expire by time (eg 30 seconds) so if you want long enough the bytes
  will already at least be delivered to the IO system (but the IO system
  can do further caching which could still take time).  On windows at
  least I definitely noticed this effect -- wait some before fync'ing
  and it's net/net much less costly.
 Yup. In fact you can just hold on to the latest commit for N seconds,
 than switch to the new latest commit.
 OS will fsync everything for you.


 I'm just playing around with stupid idea. I'd like to have NRT
 look-alike without binding readers and writers. :)
 Right now it's probably best for me to save my time and cut over to current
 NRT.
 But. An important lesson was learnt - no fsyncing blows up your index
 on out-of-disk-space.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-06 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854348#action_12854348
 ] 

Shai Erera commented on LUCENE-1709:


One more thing - change benchmark tests to run sequentially (by adding the 
property).
Robert, are you going to tackle that soon?

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-06 Thread Shai Erera (JIRA)

Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
-

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1


Benchmark allows one to set the MP and MS to use, by defining the class name 
and then use reflection to instantiate them. However NoMP and NoMS are 
singletons and therefore reflection does not work for them. Easy fix in 
CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark

2010-04-06 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2377:
---

Attachment: LUCENE-2377.patch

Patch includes both fix to CreateIndexTask as well as relevant tests to 
CreateIndexTaskTest. I plan to commit later today if there are no objections.

 Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
 -

 Key: LUCENE-2377
 URL: https://issues.apache.org/jira/browse/LUCENE-2377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2377.patch


 Benchmark allows one to set the MP and MS to use, by defining the class name 
 and then use reflection to instantiate them. However NoMP and NoMS are 
 singletons and therefore reflection does not work for them. Easy fix in 
 CreateIndexTask. I'll post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Parallel tests in Benchmark

2010-04-03 Thread Shai Erera

Ok let's do that (add runsequential to benchmark and all the rest). If
I'll run into this elsewhere as well I will report and we can talk
then about trying to find a solution for this. If it's just benchmark
then I think we'll be ok.

Shai

On Thursday, April 1, 2010, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 1, 2010 at 12:03 AM, Shai Erera ser...@gmail.com wrote:


 Hi

 I'd like to summarize a discussion I had w/ Robert and Mike last night on 
 IRC, about the parallelism of tasks in Benchmark:

 For some reason, ever since parallel tasks were introduced, when I run 'ant 
 test' from the contrib/benchmark folder (or the root), the tests just hang at 
 some point, after WriteLineDocTaskTest finishes. What's very weird is that it 
 seems I'm the only one experiencing this, and so for a long time I thought 
 it's just a problem w/ my environment ... until yesterday when I did a fresh 
 checkout of trunk, to a fresh folder and project, and still the tests stuck.

 Thread dump does not show anything relevant to Lucene code, but rather to 
 Ant. The main thread is waiting on 
 org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on 
 org/apache/tools/ant/taskdefs/Execute.waitFor and two other on 
 java/io/FileInputStream.read. But nothing is related to Lucene code, 
 directly. Also annoyingly, but conveniently for debugging that issue, it 
 happens very consistently on my machine - sometimes the test passes, but 90% 
 hangs.
 Running w/ -Drunsequential=1 consistently succeeds.

 We've explored different ways to understand the cause of the problem, and 
 came across several improvements and a workaround, but unfortunately not to a 
 definite resolution:

 * As a last resort, we can add runsequential property to benchmark build.xml, 
 which forces Benchmark tests to run sequentially. Since that's a tiny package 
 which takes a few seconds to run anyway, and parallelism doesn't improve much 
 (it actually runs slower, when it passes, on my machine: parallel=15 sec, 
 seq=11 sec), this might be acceptable.

 * Moving the junit temp files (such as that flag file) created to the temp 
 directory each test uses. This is actually a good thing to do anyway (thanks 
 Robert for spotting that), because it avoids accidental commits of such files 
 :), as well as doesn't clutter the main environment. We've done that because 
 when I hit CTR:+C to stop one of the runs which hung, we received a FNFE on a 
 junit flag file is being accessed by another process (something like that), 
 and thought this is related to the hangs I'm seeing. Anyway, this file is 
 attempted access by multiple JVMs concurrently, which seems bad.

 * Explore the JUnit Formatter code under src/test, since it uses file 
 locking. I've disabled locks (using NoLockFactory), however the test still 
 hung.

 * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We think 
 that might be a good thing to do anyway - if people run on machines with just 
 one CPU, threading is not expected to help much, as opposed to running on 
 multiple CPUs. But we don't want to enforce it on anyone, so we think to 
 change the default to '1', but introduce a property 'threadsPerProcessor' 
 which users will be able to set explicitly.
 ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad 
 W500), the test consistently passes - it just doesn't like the value '2'. At 
 least it passed as long as I ran it, maybe a thread hang is lurking for me 
 around the corner somewhere.

 * We made sure the benchmark tests indeed read/write the test data files 
 from/to unique directories. But like I said - there is no hang in Lucene code 
 reported in the thread dump.

 It was very late last night when we stopped, and my eyes were tired, so I 
 didn't summarize it right away. Robert, I hope I've captured everything we 
 did, if not please add.

 Anyone's got any suggestions? It's unfortunate that I'm the only one running 
 into this problem, because whatever the suggestions are, you'll probably need 
 me to confirm them :). And I'm going away for 3 days (camping - no internet 
 ... well at least no laptop :)), so unless someone has a suggestion within 
 the coming few hours, we can continue that when I get back.

 Shai


 I think you got everything. I reopened the JIRA issue too (LUCENE-1709) and 
 listed the things we can do for sure now, such as lowering 
 threadsPerProcessor (and allowing someone to use a system property to 
 override this) and fixing junit temp files to be in the temp directory. 
 Additionally I would like to fix the ant library problem as mentioned there. 
 it works great from the command-line but we should improve this for 
 IDE-users, so they do not see a compile error.

 I am personally for the idea of adding the runsequential property to 
 benchmark's build.xml, to force it to run serially. While I am unable to 
 reproduce your problem, it does not surprise me, as I had a tough time trying 
 to prevent benchmark

Re: Landing the flex branch

2010-04-03 Thread Shai Erera

bq. Try a merge back: This would let flex appear as a single commit to
trunk, so the history of trunk would be preserved.

 +1 for that - I think the history of trunk is important to preserve.
And there is also a way to ask for flex's history so everybody win?

Shai

On Thursday, April 1, 2010, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 we should think about how to merge the changes to trunk. I can try this out 
 during the weekend, to merge back the changes to trunk, but this can be very 
 hard. So we have the following options:

 Try a merge back: This would let flex appear as a single commit to trunk, so 
 the history of trunk would be preserved. If somebody wants to see the changes 
 in the flex branch, he could ask for them (e.g. in TortoiseSVN there is a 
 checkbox Include merged revisions). If this is not easy or fails, we can do 
 the following:

 - Create a big diff between current trunk and flex (after flex is merged up 
 to trunk). Attach this patch to an issue and let everybody review. After that 
 we can apply the patch to trunk. This would result in the same behavior for 
 trunk, no changes lost, but all changes in flex cannot be reviewed.
 - Delete current trunk and svn move the branch to trunk (after flex is merged 
 up to trunk): This would make the history of flex the current history. The 
 drawback: You losse latest trunk changes since the split of flex. Instead you 
 will only see the merge messages. Therefore we should see this only as a last 
 chance.

 Comments?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, March 30, 2010 5:35 PM
 To: java-dev@lucene.apache.org
 Subject: Landing the flex branch

 I think the time has finally come!  Pending one issue (LUCENE-2354 --
 Uwe), I think flex is ready to land I think the other issues with
 Fix
 Version = Flex Branch can be moved to 3.1 after we land.

 We still use the pre-flex APIs in a number of places... I think this
 is actually good (so we continue to test the back-compat emulation
 layer).  With time we can cut them over.

 After flex, there are a number of fun things to explore.  EG, we need
 to make attributes work well with codecs  indexing/searching (with
 Multi/DirReader, serailize/unserialize, etc.); we need a BytesRef +
 packed ints FieldCache StringIndex variant which should use much less
 RAM in certain cases; we should build a fast core PForDelta codec;
 more queries can cutover to operating directly on byte[] terms, etc.
 But these can all come with time...

 Thoughts/issues/objections?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Uwe Schindler to the Lucene PMC

2010-04-01 Thread Shai Erera

Congratulations Uwe !

Shai

On Thursday, April 1, 2010, Earwin Burrfoot ear...@gmail.com wrote:
 Generics SpecOps made it to the top and are gonna rule us from the
 shadows :)  Congrats!

 On Thu, Apr 1, 2010 at 16:37, Robert Muir rcm...@gmail.com wrote:
 Congrats Uwe!

 On Thu, Apr 1, 2010 at 7:05 AM, Grant Ingersoll gsing...@apache.org wrote:

 I'm pleased to announce that the Lucene PMC has voted to add Uwe Schindler
 to the PMC.  Uwe has been doing a lot of work in Lucene and Solr, including
 several of the last releases in Lucene.

 Please join me in extending congratulations to Uwe!

 -Grant Ingersoll
 PMC Chair
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 Robert Muir
 rcm...@gmail.com




 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity


[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851829#action_12851829
 ] 

Shai Erera commented on LUCENE-2310:


+1 for this simplification. Can we just name it Indexable, and omit Document 
from it? That way, it's both shorter and less chances for users to directly 
link it w/ Document.

One thing I didn't understand though, is what will happen to ir/is.doc() 
method? Will those be deprecated in favor of some other class which receives an 
IR as parameter and knows how to re-construct Indexable(Document)?

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male
 Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-AbstractField.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields-core.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch, 
 LUCENE-2310-Deprecate-DocumentGetFields.patch


 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames


 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera reassigned LUCENE-2353:
--

Assignee: Shai Erera

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames


[ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851836#action_12851836
 ] 

Shai Erera commented on LUCENE-2353:


Unless there are objections, I plan to commit this shortly

 Config incorrectly handles Windows absolute pathnames
 -

 Key: LUCENE-2353
 URL: https://issues.apache.org/jira/browse/LUCENE-2353
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2353.patch, LUCENE-2353.patch


 I have no idea how no one ran into this so far, but I tried to execute an 
 .alg file which used ReutersContentSource and referenced both docs.dir and 
 work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
 run reported an error of missing content under benchmark\work\something.
 I've traced the problem back to Config, where get(String, String) includes 
 the following code:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 ...
 {code}
 It detects : in the value and so it thinks it's a per-round property, thus 
 stripping d: from the value ... fix is very simple:
 {code}
 if (sval.indexOf(:)  0) {
   return sval;
 } else if (sval.indexOf(:\\) = 0) {
   // this previously messed up absolute path names on Windows. Assuming
   // there is no real value that starts with \\
   return sval;
 }
 // first time this prop is extracted by round
 int k = sval.indexOf(:);
 String colName = sval.substring(0, k);
 sval = sval.substring(k + 1);
 {code}
 I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

[
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851842#action_12851842
]

Shai Erera commented on LUCENE-2310:

Right Earwin - agreed.

I'd like to summarize a brief discussion we had on IRC around that:
The idea is not to provide another interface/class for search purposes, but
rather expose the right API from IndexReader, even if it might be a bit
low-level. API like getIndexedFields(docId) and getStorefFields(docId), both
optionally take a FieldSelector, should allow the application to re-construct
its Indexable however it wants. And IR/IS don't need to know anything about
that.
To complete the picture for current users, we can have a static reconstruct()
on Document which takes IR, docId and FieldSelector ...

BTW, I'm not even sure getIndedxedFields can be efficiently supported today.
Just listing it here for completeness.

Reduce Fieldable, AbstractField and Field complexity

Key: LUCENE-2310
URL: https://issues.apache.org/jira/browse/LUCENE-2310
Project: Lucene - Java
Issue Type: Sub-task
Components: Index
Reporter: Chris Male
Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch,
LUCENE-2310-Deprecate-AbstractField.patch,
LUCENE-2310-Deprecate-AbstractField.patch,
LUCENE-2310-Deprecate-AbstractField.patch,
LUCENE-2310-Deprecate-DocumentGetFields-core.patch,
LUCENE-2310-Deprecate-DocumentGetFields.patch,
LUCENE-2310-Deprecate-DocumentGetFields.patch

In order to move field type like functionality into its own class, we really
need to try to tackle the hierarchy of Fieldable, AbstractField and Field.
Currently AbstractField depends on Field, and does not provide much more
functionality that storing fields, most of which are being moved over to
FieldType. Therefore it seems ideal to try to deprecate AbstractField (and
possible Fieldable), moving much of the functionality into Field and
FieldType.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames