[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-08 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854853#action_12854853
 ] 

Toke Eskildsen commented on LUCENE-2380:


Working on LUCENE-2369 I essentially had to re-implement the FieldCache because 
of the hardwiring of arrays. Switching to accessor methods seems like the right 
direction to go.

 Add FieldCache.getTermBytes, to load term data as byte[]
 

 Key: LUCENE-2380
 URL: https://issues.apache.org/jira/browse/LUCENE-2380
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1


 With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
 string, but not necessarily), so we need to push this up the search stack.
 FieldCache now has getStrings and getStringIndex; we need corresponding 
 methods to load terms as native byte[], since in general they may not be 
 representable as String.  This should be quite a bit more RAM efficient too, 
 for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Getting fsync out of the loop

2010-04-08 Thread Michael McCandless
On Wed, Apr 7, 2010 at 3:27 PM, Earwin Burrfoot ear...@gmail.com wrote:
 No, this doesn't make sense.  The OS detects a disk full on accepting
 the write into the write cache, not [later] on flushing the write
 cache to disk.  If the OS accepts the write, then disk is not full (ie
 flushing the cache will succeed, unless some other not-disk-full
 problem happens).

 Hmmm, at least, normally.  What OS/IO system were you on when you saw
 corruption due to disk full when fsync is disabled?

 I'm still skeptical that disk full even with fsync disabled can lead
 to corruption I'd like to see some concrete proof :)

 Linux 2.6.30-1-amd64, ext3, simple scsi drive

Hm.  Linux should detect disk full on the initial write.

 I checked with our resident DB brainiac, he says such things are possible.

 Okay, I'm not 100% sure this is the cause of my corruptions. It just happened
 that when the index got corrupted, disk space was also used up - several 
 times.
 I had that silent-fail-to-write theory and checked it up with some 
 knowledgeable
 people. Even if they are right, I can be mistaken and the root cause
 is different.

OK... if you get a more concrete case where disk full causes
corruption when you disable fsync, please post details back.  From
what I understand this should never happen.

 You're mixing up terminology a bit here -- you can't hold on to the
 latest commit then switch to it.  A commit (as sent to the deletion
 policy) means a *real* commit (ie, IW.commit or IW.close was called).
 So I think your BG thread would simply be calling IW.commit every N
 seconds?
 Under hold on to I meant - keep from being deleted, like SnapshotDP does.

But, IW doesn't let you hold on to checkpoints... only to commits.

Ie SnapshotDP will only see actual commit/close calls, not
intermediate checkpoints like a random segment merge completing, a
flush happening, etc.

Or... maybe you would in fact call commit frequently from the main
threads (but with fsync disabled), and then your DP holds onto these
fake commits, periodically picking one of them to do the real
fsync ing?

 I'm just playing around with stupid idea. I'd like to have NRT
 look-alike without binding readers and writers. :)
 I see... well binding durability  visibility will always be costly.
 This is why Lucene decouples them (by making NRT readers available).
 My experiments do the same, essentially.

 But after I understood that to perform deletions IW has to load term indexes
 anyway, I'm almost ready to give up and go for intertwined IW/IR mess :)

Hey if you really think it's a mess, post a patch that cleans it up :)

 BTW, if you know your OS/IO system always persists cached writes w/in
 N seconds, a safe way to avoid fsync is to use a by-time expiring
 deletion policy.  Ie, a commit stays alive as long as its age is less
 than X... DP's unit test has such a policy.  But you better really
 know for sure that the OS/IO system guarantee that :)
 Yeah. I thought of it, but it is even more shady :)

I agree.  And even if you know you're on Linux, and that your pdflush
flushes after X seconds, you still have the IO system to contend with.

Best to stick with fsync, commit only for safety as needed by the app,
and use NRT for fast visibility.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854876#action_12854876
 ] 

Michael McCandless commented on LUCENE-2376:


OK but I suspect the root cause is the same here -- your index seems to have a 
truly massive number of fields.  Can you post the CheckIndex output?

IW re-uses per-field objects internally, so that many docs with the same field 
can be indexed more efficiently.  However, when IW sweeps to free up RAM, if it 
notices an allocated field object hasn't been used recently, because that field 
name has not occurred in recently added docs, it frees up that memory and logs 
that purge field.  So from this output I can see you have at least 43K unique 
field names.

If you have not disabled norms on these fields you'll certainly run out of 
memory.  Even if you disable norms, Lucene is in general not optimized for a 
tremendous number of unique fields and you'll likely hit other issues.


 java.lang.OutOfMemoryError:Java heap space
 --

 Key: LUCENE-2376
 URL: https://issues.apache.org/jira/browse/LUCENE-2376
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1
 Environment: Windows
Reporter: Shivender Devarakonda
 Attachments: InfoStreamOutput.txt


 I see an OutOfMemory error in our product and it is happening when we have 
 some data objects on which we built the index. I see the following 
 OutOfmemory error, this is happening after we call Indexwriter.optimize():
 4/06/10 02:03:42.160 PM PDT [ERROR] [Lucene Merge Thread #12]  In thread 
 Lucene Merge Thread #12 and the message is 
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
 4/06/10 02:03:42.207 PM PDT [VERBOSE] [Lucene Merge Thread #12] [Manager] 
 Uncaught Exception in thread Lucene Merge Thread #12
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at java.util.HashMap.resize(HashMap.java:462)
   at java.util.HashMap.addEntry(HashMap.java:755)
   at java.util.HashMap.put(HashMap.java:385)
   at org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:256)
   at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:366)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
   at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:116)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:638)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:608)
   at 
 org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:686)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4979)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
 4/06/10 02:03:42.895 PM PDT [ERROR]  this writer hit an OutOfMemoryError; 
 cannot complete optimize

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Move NoDeletionPolicy to core

2010-04-08 Thread Shai Erera
Hi

I've noticed benchmark has a NoDeletionPolicy class and I was wondering if
we can move it to core. I might want to use it for the parallel index stuff,
but I think it'll also fit nicely in core, together with the other No*
classes. In addition, this class should be made a singleton.

If moving to core is acceptable, do you think any bw policy needs to be
enforced (such as deprecating the one in benchmark and reference the one in
core? I'll also want to change the package name from o.a.l.benchmark.utils
to o.a.l.index, where the other IDPs are.

Simple move and change (and update to benchmark algs which use it.

Shai


[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854882#action_12854882
 ] 

Uwe Schindler commented on LUCENE-2074:
---

As requested on the mailing list, I will look into resetting the zzBuffer on 
Tokenizer.reset(Reader).

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854885#action_12854885
 ] 

Shai Erera commented on LUCENE-2074:


Uwe, must this be coupled with that issue? This one waits for a long time (why? 
for JFlex 1.5 release?) and protecting against a huge buffer allocation can be 
a real quick and tiny fix. And this one also focuses on getting Unicode 5 to 
work, which is unrelated to the buffer size. But the buffer size is not a 
critical issue either that we need to move fast with it ... so it's your call. 
Just thought they are two unrelated problems.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854886#action_12854886
 ] 

Uwe Schindler commented on LUCENE-2074:
---

I plan to commit this soon! So any patch will get outdated, thats why i want to 
fix this here. And as this patch removes direct access from the Tokenizer to 
the lexer (as it is only accessible through an interface now), we have to 
change the jflex file to do it correctly.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854887#action_12854887
 ] 

Shai Erera commented on LUCENE-2074:


bq. I plan to commit this soon! 

That's great news !

BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check 
it in somewhere? because it hasn't been released last I checked. I'm asking for 
general knowledge, because I know the scripts are downloading it, or rely on it 
to exist somewhere.

In that case, then yes, let's fix it here.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854890#action_12854890
 ] 

Uwe Schindler commented on LUCENE-2074:
---

You dont need the jflex binaries in general, only if you reconstruct the source 
files (using ant jflex). And its easy to generate, check out and start mvn 
install.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Move NoDeletionPolicy to core

2010-04-08 Thread Michael McCandless
+1

I don't think bw needs to be kept -- contrib/benchmark is allowed to change.

Mike

On Thu, Apr 8, 2010 at 5:44 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 I've noticed benchmark has a NoDeletionPolicy class and I was wondering if
 we can move it to core. I might want to use it for the parallel index stuff,
 but I think it'll also fit nicely in core, together with the other No*
 classes. In addition, this class should be made a singleton.

 If moving to core is acceptable, do you think any bw policy needs to be
 enforced (such as deprecating the one in benchmark and reference the one in
 core? I'll also want to change the package name from o.a.l.benchmark.utils
 to o.a.l.index, where the other IDPs are.

 Simple move and change (and update to benchmark algs which use it.

 Shai


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2074:
--

Attachment: LUCENE-2074.patch

Here a new patch, with the zzBuffer reset to default implemented in a separate 
reset(Reader) method. As yyReset is generated as final, I had to change the 
name.

Before apply, run:

{noformat}
svn copy StandardTokenizerImpl.* to StandardTokenizerImplOrig.* 
svn move StandardTokenizerImpl.* to StandardTokenizerImpl31.* 
{noformat}

I will commit this in a day or two!

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2074:
--

Attachment: LUCENE-2074.patch

Updated also the error message about missing jflex when calling ant jflex to 
regenerate the lexers. The message now contains instructions for downloading 
and building JFlex. Also add CHANGES.txt.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854899#action_12854899
 ] 

Mark Miller commented on LUCENE-2074:
-

{quote}Uwe, must this be coupled with that issue? This one waits for a long 
time (why? for JFlex 1.5 release?) and protecting against a huge buffer 
allocation can be a real quick and tiny fix. And this one also focuses on 
getting Unicode 5 to work, which is unrelated to the buffer size. But the 
buffer size is not a critical issue either that we need to move fast with it 
... so it's your call. Just thought they are two unrelated problems.{quote}

Agreed. Whether its fixed as part of this commit or not, it really deserves its 
own issue anyway, for changes and tracking. It has nothing to do with this 
issue other than convenience. 

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2074:
--

Attachment: LUCENE-2074.patch

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2074:
--

Attachment: (was: LUCENE-2074.patch)

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Uwe Schindler (JIRA)
Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
-

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


When indexing large documents, the lexer buffer may stay large forever. This 
sub-issue resets the lexer buffer back to the default on reset(Reader).

This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854900#action_12854900
 ] 

Uwe Schindler commented on LUCENE-2074:
---

Created sub-issue: LUCENE-2384

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Ruben Laguna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854901#action_12854901
 ] 

Ruben Laguna commented on LUCENE-2384:
--

The mailing list discussion that originated this is [1]


[1] http://lucene.markmail.org/thread/ndmcgffg2mnwjo47



 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854902#action_12854902
 ] 

Robert Muir commented on LUCENE-2384:
-

If tokenizers like StandardTokenizer just end out reading things into ram 
anyway, we should remove Reader from the Tokenizer interface.

supporting reader instead of simply tokenizing the entire doc causes our 
tokenizers to be very very complex (see CharTokenizer).
It would be nice to remove this complexity, if the objective doesn't really 
work anyway.

 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854903#action_12854903
 ] 

Uwe Schindler commented on LUCENE-2384:
---

For JFlex this does not help as the Jflex-generated code always needs a Reader. 
This is special here, the lexer will not need to load the whole document into 
the reader, it only needs sometimes a large look forward/backwards buffer.

 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Ruben Laguna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruben Laguna updated LUCENE-2384:
-

Attachment: reset.diff

patch to reset the zzBuffer when the input is reseted. The code is really taken 
from 
https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422...@web38901.mail.mud.yahoo.com
  so I can't really grant license to use it but I think the guy realeased it as 
public domain by posting it to the mailing list

 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: reset.diff


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Ruben Laguna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854905#action_12854905
 ] 

Ruben Laguna edited comment on LUCENE-2384 at 4/8/10 11:24 AM:
---

patch to reset the zzBuffer when the input is reseted. The code is really taken 
from 
https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422...@web38901.mail.mud.yahoo.com
  so I can't really grant license to use it but I think the guy realeased it as 
public domain by posting it to the mailing list. 

I tested it and it seems to work for me. Just including it here is case 
somebody want to apply the patch directly to 3.0.1 (although it's better to 
wait for 3.1)

  was (Author: ecerulm):
patch to reset the zzBuffer when the input is reseted. The code is really 
taken from 
https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422...@web38901.mail.mud.yahoo.com
  so I can't really grant license to use it but I think the guy realeased it as 
public domain by posting it to the mailing list
  
 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: reset.diff


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854906#action_12854906
 ] 

Robert Muir commented on LUCENE-2384:
-

bq. For JFlex this does not help as the Jflex-generated code always needs a 
Reader.

This can be fixed. Currently all I/O in all tokenizers is broken and buggy, and 
does not correctly handle special cases around their 'buffering'.

The only one that is correct is CharTokenizer, but at what cost? It has so much 
complexity because of this Reader issue.

We should stop pretending like we can really stream docs with Reader.
We should stop pretending like 8GB documents or something exist, where we cant 
just analyze the whole doc at once and make things simple.
And then we can fix the lucene tokenizers to be correct.


 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: reset.diff


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854908#action_12854908
 ] 

Uwe Schindler commented on LUCENE-2384:
---

{quote}
patch to reset the zzBuffer when the input is reseted. The code is really taken 
from 
https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422...@web38901.mail.mud.yahoo.com
 so I can't really grant license to use it but I think the guy realeased it as 
public domain by posting it to the mailing list. 
I tested it and it seems to work for me. Just including it here is case 
somebody want to apply the patch directly to 3.0.1 (although it's better to 
wait for 3.1)
{quote}

Your fix adds an addtional complexity. Just reset the buffer back to the 
default ZZ_BUFFERSIZE if grown on reset. Your patch always reallocates a new 
buffer.

Use this:
{code}
public final void reset(Reader r) {
  // reset to default buffer size, if buffer has grown
  if (zzBuffer.length  ZZ_BUFFERSIZE) {
zzBuffer = new char[ZZ_BUFFERSIZE];
  }
  yyreset(r);
}
{code}

 Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
 -

 Key: LUCENE-2384
 URL: https://issues.apache.org/jira/browse/LUCENE-2384
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Affects Versions: 3.0.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: reset.diff


 When indexing large documents, the lexer buffer may stay large forever. This 
 sub-issue resets the lexer buffer back to the default on reset(Reader).
 This is done on the enclosing issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2010-04-08 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854919#action_12854919
 ] 

Jukka Zitting commented on LUCENE-1482:
---

We use SLF4J in Jackrabbit, and having logs from the embedded Lucene index 
available through the same mechanism would be quite useful in some situations.

BTW, using isDebugEnabled() is often not necessary with SLF4J, see 
http://www.slf4j.org/faq.html#logging_performance

 Replace infoSteram by a logging framework (SLF4J)
 -

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
 slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar


 Lucene makes use of infoStream to output messages in its indexing code only. 
 For debugging purposes, when the search application is run on the customer 
 side, getting messages from other code flows, like search, query parsing, 
 analysis etc can be extremely useful.
 There are two main problems with infoStream today:
 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
 other classes I need to either expose an API or propagate infoStream to all 
 classes (see for example DocumentsWriter, which receives its infoStream 
 instance from IndexWriter).
 2. I can either turn debugging on or off, for the entire code.
 Introducing a logging framework can allow each class to control its logging 
 independently, and more importantly, allows the application to turn on 
 logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
 I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
 as it names states, a facade over different logging frameworks. As such, you 
 can include the slf4j.jar in your application, and it recognizes at deploy 
 time what is the actual logging framework you'd like to use. SLF4J comes with 
 several adapters for Java logging, Log4j and others. If you know your 
 application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
 your classpath, and your logging statements will use Java logging underneath 
 the covers.
 This makes the logging code very simple. For a class A the logger will be 
 instantiated like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
 }
 And will later be used like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
   public void foo() {
 if (logger.isDebugEnabled()) {
   logger.debug(message);
 }
   }
 }
 That's all !
 Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
 (but I assume it's fast also over other logging frameworks).
 The important thing is, every class controls its own logger. Not all classes 
 have to output logging messages, and we can improve Lucene's logging 
 gradually, w/o changing the API, by adding more logging messages to 
 interesting classes.
 I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854920#action_12854920
 ] 

Shai Erera commented on LUCENE-1482:


I still think that calling isDebugEnabled is better, because the message 
formatting stuff may do unnecessary things like casting, autoboxing etc. IMO, 
if logging is enabled, evaluating it twice is not a big deal ... it's a simple 
check.

I'm glad someone here thinks logging will be useful though :). I wish there 
will be quorum here to proceed w/ that.

Note that I also offered to not create any dependency on SLF4J, but rather 
extract infoStream to a static InfoStream class, which will avoid passing it 
around everywhere, and give the flexibility to output stuff from other classes 
which don't have an infoStream at hand.

 Replace infoSteram by a logging framework (SLF4J)
 -

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
 slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar


 Lucene makes use of infoStream to output messages in its indexing code only. 
 For debugging purposes, when the search application is run on the customer 
 side, getting messages from other code flows, like search, query parsing, 
 analysis etc can be extremely useful.
 There are two main problems with infoStream today:
 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
 other classes I need to either expose an API or propagate infoStream to all 
 classes (see for example DocumentsWriter, which receives its infoStream 
 instance from IndexWriter).
 2. I can either turn debugging on or off, for the entire code.
 Introducing a logging framework can allow each class to control its logging 
 independently, and more importantly, allows the application to turn on 
 logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
 I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
 as it names states, a facade over different logging frameworks. As such, you 
 can include the slf4j.jar in your application, and it recognizes at deploy 
 time what is the actual logging framework you'd like to use. SLF4J comes with 
 several adapters for Java logging, Log4j and others. If you know your 
 application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
 your classpath, and your logging statements will use Java logging underneath 
 the covers.
 This makes the logging code very simple. For a class A the logger will be 
 instantiated like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
 }
 And will later be used like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
   public void foo() {
 if (logger.isDebugEnabled()) {
   logger.debug(message);
 }
   }
 }
 That's all !
 Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
 (but I assume it's fast also over other logging frameworks).
 The important thing is, every class controls its own logger. Not all classes 
 have to output logging messages, and we can improve Lucene's logging 
 gradually, w/o changing the API, by adding more logging messages to 
 interesting classes.
 I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Created: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-08 Thread Erick Erickson
What kind of JVM settings are you using? Lots of people index lots of
documents
without running into this, can you provide more specifics about your
indexing
settings?

On Tue, Apr 6, 2010 at 10:51 PM, Shivender Devarakonda (JIRA) 
j...@apache.org wrote:

 java.lang.OutOfMemoryError:Java heap space
 --

 Key: LUCENE-2376
 URL: https://issues.apache.org/jira/browse/LUCENE-2376
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1
 Environment: Windows
Reporter: Shivender Devarakonda


 I see an OutOfMemory error in our product and it is happening when we have
 some data objects on which we built the index. I see the following
 OutOfmemory error, this is happening after we call Indexwriter.optimize():


 4/06/10 02:03:42.160 PM PDT [ERROR] [Lucene Merge Thread #12]  In thread
 Lucene Merge Thread #12 and the message is
 org.apache.lucene.index.MergePolicy$MergeException:
 java.lang.OutOfMemoryError: Java heap space
 4/06/10 02:03:42.207 PM PDT [VERBOSE] [Lucene Merge Thread #12] [Manager]
 Uncaught Exception in thread Lucene Merge Thread #12
 org.apache.lucene.index.MergePolicy$MergeException:
 java.lang.OutOfMemoryError: Java heap space
at
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:462)
at java.util.HashMap.addEntry(HashMap.java:755)
at java.util.HashMap.put(HashMap.java:385)
at
 org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:256)
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:366)
at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
at
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:116)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:638)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:608)
at
 org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:686)
at
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4979)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614)
at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
 4/06/10 02:03:42.895 PM PDT [ERROR]  this writer hit an OutOfMemoryError;
 cannot complete optimize


 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854957#action_12854957
 ] 

Tom Burton-West commented on LUCENE-1709:
-

I am having the same issue Shai reported in LUCENE-2353 with the parallel tests 
apparently causing the tests to hang on my Windows box with both Revision 
931573 and Revision   931304 when running the tests from root.

Tests  hang in WriteLineDocTaskTest, on this line:
[junit]  config properties:
[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line
[junit] --- 


I just ran the test last night with Revision  931708 and had no problem.   Ran 
it again this morning and got the hanging behavior.  The difference is that 
last night the only thing running on my computer besides a couple of ssh 
terminal windows was the  tests.  Today when I ran the tests and got the 
hanging behavior, I have firefox, outlook, exceed, wordpad open.  The tests are 
taking 98-99.9% of my cpu while hanging.  I suspect there is some kind of 
resource issue when running the tests in parallel.

Tom Burton-West

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854959#action_12854959
 ] 

Robert Muir commented on LUCENE-1709:
-

Thanks Tom and Shai... sorry I haven't gotten to fix this yet.

Shai, would you mind committing your patch? we can keep the issue open to add 
the sysprop and fix the ant jar thing, and apply the same fixes to Solr's 
build.xml


 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854960#action_12854960
 ] 

Tom Burton-West commented on LUCENE-1709:
-

This may or may not be a clue to the problem in benchmark.  When I control-C'd 
the hung test, I got the error reported below.
Tom.


[junit] directory = RAMDirectory
[junit] doc.maker = 
org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker
[junit] line.file.out = 
C:\cygwin\home\tburtonw\lucene\april07_good\build\contrib\benchmark\test\W\one-line
[junit] ---
[junit] -  ---
[junit] java.io.FileNotFoundException: 
C:\cygwin\home\tburtonw\lucene\april07_good\contrib\benchmark\junitvmwatcher203463231158436475.properties
 (The process cannot access the file because it is being used by another 
process)
[junit] at java.io.FileInputStream.open(Native Method)
[junit] at java.io.FileInputStream.init(FileInputStream.java:106)
[junit] at java.io.FileReader.init(FileReader.java:55)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.executeAsForked(JUnitTask.java:1025)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:876)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:803)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:62)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:394)
[junit] at 
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288)
[junit] at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
[junit] at org.apache.tools.ant.Task.perform(Task.java:348)
[junit] at 
org.apache.tools.ant.taskdefs.Parallel$TaskRunnable.run(Parallel.java:428)
[junit] at java.lang.Thread.run(Thread.java:619)


 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854967#action_12854967
 ] 

Robert Muir commented on LUCENE-1709:
-

Thanks Tom, this is exactly what happened to Shai.

Can you try his patch and see if it fixed the problem for you?

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855020#action_12855020
 ] 

Shai Erera commented on LUCENE-1709:


Robert, I will commit the patch, seems good to do anyway. We can handle the ant 
jars separately later.

And ths hang behavior is exactly what I experience, including the 
FileInputStream thing. Only on my machine, when I took a thread dump, it showed 
that Ant waits on FIS.read() ...

Robert - to remind you that even with the patch which forces junit to use a 
separate temp folder per thread, it still hung ... 

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-04-08 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855022#action_12855022
 ] 

Tom Burton-West commented on LUCENE-1709:
-

Hi Robert,

I patched Revision 931708 and ran ant clean test-contribute and the tests ran 
just fine.  The patch seems to have solved the problem.

Tom

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, 
 LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)
Move NoDeletionPolicy from benchmark to core


 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1


As the subject says, but I'll also make it a singleton + add some unit tests, 
as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)
IndexWriter commits unnecessarily on fresh Directory


 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... 
why do we need that commit? Do we really expect people to open an IndexReader 
on an empty Directory which they just passed to an IW w/ create=true? If they 
want, they can simply call commit() right away on the IW they created.

I ran into this when writing a test which committed N times, then compared the 
number of commits (via IndexReader.listCommits) and was surprised to see N+1 
commits.

Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
jumping on me .. so the change might not be that simple. But I think it's 
manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Move NoDeletionPolicy to core, adds javadocs + TestNoDeletionPolicy. Also 
includes the relevant changes to benchmark (algorithms + CreateIndexTask).
I've fixed a typo I had in NoMergeScheduler - not related to this issue, but 
since it was just a typo, thought it's no harm to do it here.

Tests pass. Planning to commit shortly.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855131#action_12855131
 ] 

Shai Erera commented on LUCENE-2386:


Took a look at IndexFileDeleter, and located to offending code segment which is 
responsible for the IndexCorruptException:
{code}
if (currentCommitPoint == null) {
  // We did not in fact see the segments_N file
  // corresponding to the segmentInfos that was passed
  // in.  Yet, it must exist, because our caller holds
  // the write lock.  This can happen when the directory
  // listing was stale (eg when index accessed via NFS
  // client with stale directory listing cache).  So we
  // try now to explicitly open this commit point:
  SegmentInfos sis = new SegmentInfos();
  try {
sis.read(directory, segmentInfos.getCurrentSegmentFileName(), codecs);
  } catch (IOException e) {
throw new CorruptIndexException(failed to locate current segments_N 
file);
  }
{code}

Looks like this code protects against a real problem, which was raised on the 
list a couple of times already - stale NFS cache. So I'm reluctant to remove 
that check ... thought I still think we should differentiate between a newly 
created index on a fresh Directory, to a stale NFS problem. Maybe we can pass a 
boolean isNew or something like that to the ctor, and if it's a new index and 
the last commit point is missing, IFD will not throw the exception, but 
silently ignore that? So the code would become something like this:
{code}
if (currentCommitPoint == null  !isNew) {
   
}
{code}

Does this make sense, or am I missing something?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-08 Thread Ruben Laguna (JIRA)
IndexWriter retains references to Readers used in Fields (memory leak)
--

 Key: LUCENE-2387
 URL: https://issues.apache.org/jira/browse/LUCENE-2387
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0.1
Reporter: Ruben Laguna


As described in [1] IndexWriter retains references to Reader used in Fields and 
that can lead to big memory leaks when using tika's ParsingReaders (as those 
can take 1MB per ParsingReader). 

[2] shows a screenshot of the reference chain to the Reader from the 
IndexWriter taken with Eclipse MAT (Memory Analysis Tool) . The chain is the 
following:

IndexWriter - DocumentsWriter - DocumentsWriterThreadState - 
DocFieldProcessorPerThread  - DocFieldProcessorPerField - Fieldable - Field 
(fieldsData) 


-
[1] http://markmail.org/thread/ndmcgffg2mnwjo47
[2] http://skitch.com/ecerulm/n7643/eclipse-memory-analyzer



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855135#action_12855135
 ] 

Michael McCandless commented on LUCENE-2386:


I agree: IW really should not commit the first segments_1, for CREATE when Dir 
has no index already.  App should immediately .commit() if it really wants to.

We should fix IFD to know if it's dealing with a known new index and bypass 
that check that works around stale NFS dir listing (boolean arg sounds good).

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855136#action_12855136
 ] 

Uwe Schindler commented on LUCENE-2385:
---

The patch does not look like you svn moved the files. To preserve history, you 
should do a svn move of the file in your local repository and then modify it 
to reflect the package changes (if any).

Did you do this?

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855140#action_12855140
 ] 

Shai Erera commented on LUCENE-2385:


I did that first, but then remembered that when I did that in the past, people 
were unable to apply my patches, w/o doing the svn move themselves. Anyway, for 
this file it's not really important I think - a very simple and tiny file, w/ 
no history to preserve? Is that ok for this file (b/c I have no idea how to do 
the svn move now ... after I've made all the changes already) :)

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855148#action_12855148
 ] 

Shai Erera commented on LUCENE-2386:


Looking at IFD again, I think a boolean ctor arg is not required. What I can do 
is check if any Lucene file has been seen (in the for-loop iteration on the 
Directory files), and if not, then deduce it's a new Directory, and skip that 
'if' check. I'll give it a shot.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855150#action_12855150
 ] 

Uwe Schindler commented on LUCENE-2385:
---

In general we place a list of all svn move/copy command together with the 
patch, executeable from the root dir. If you paste those commands into your 
terminal and then apply the patch, it works. One example is the jflex issue 
(ok, the commands are shortened).

Another possibility is to have a second checkout, where you arrange the files 
correctly (svn moved/copied) and one for creating the patches.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2385:
---

Attachment: LUCENE-2385.patch

Is it better now?

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855155#action_12855155
 ] 

Shai Erera commented on LUCENE-2385:


Forgot to mention that the only move I made was of NoDeletionPolicy:

svn move 
contrib/benchmark/src/java/org/apache/lucene/benchmark/utils/NoDeletionPolicy.java
 src/java/org/apache/lucene/index/NoDeletionPolicy.java

I'll remember that in the future Uwe - thanks for the heads up !

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855164#action_12855164
 ] 

Uwe Schindler commented on LUCENE-2385:
---

Yeah thats fine!

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2385.


Resolution: Fixed

Committed revision 932129.

 Move NoDeletionPolicy from benchmark to core
 

 Key: LUCENE-2385
 URL: https://issues.apache.org/jira/browse/LUCENE-2385
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark, Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1

 Attachments: LUCENE-2385.patch, LUCENE-2385.patch


 As the subject says, but I'll also make it a singleton + add some unit tests, 
 as well as some documentation. I'll post a patch hopefully today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: IndexWriter memory leak?

2010-04-08 Thread Uwe Schindler
There is one possibility, that could be fixed:

As Tokenizers are reused, the analyzer holds a reference to the last used 
Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If this 
is the case for you, that may be easy to do. So Tokenizer.close() looks like 
this:

  /** By default, closes the input Reader. */
  @Override
  public void close() throws IOException {
input.close();
input = null; // -- new!
  }

Any comments from other committers?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Ruben Laguna [mailto:ruben.lag...@gmail.com]
 Sent: Thursday, April 08, 2010 2:50 PM
 To: java-u...@lucene.apache.org
 Subject: Re: IndexWriter memory leak?
 
 I will double check in the afternoon the heapdump.hprof. But I think
 that
 *some* readers are indeed held by
 docWriter.threadStates[0].consumer.fieldHash[1].fields[],
 as shown in [1] (this heapdump contains only live objects).  The
 heapdump
 was taken after IndexWriter.commit() /IndexWriter.optimize() and all
 the
 Documents were already indexed and GCed (I will double check).
 
 So that would mean that the Reader is retained in memory by the
 following
 chaing of references,
 
 DocumentsWriter - DocumentsWriterThreadState -
 DocFieldProcessorPerThread
 - DocFieldProcessorPerField - Fieldable - Field (fieldsData)
 
 I'll double check with Eclipse MAT as I said that this chain is
 actually
 made of hard references only (no SoftReferences,WeakReferences, etc). I
 will
 also double check also that there is no live Document that is
 referencing
 the Reader via the Field.
 
 
 [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
 
 On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler u...@thetaphi.de wrote:
 
  Readers are not held. If you indexed the document and gced the
 document
  instance they readers are gone.
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: Ruben Laguna [mailto:ruben.lag...@gmail.com]
   Sent: Thursday, April 08, 2010 1:28 PM
   To: java-u...@lucene.apache.org
   Subject: Re: IndexWriter memory leak?
  
   Now that the zzBuffer issue is solved...
  
   what about the references to the Readers held by docWriter. Tika´s
   ParsingReaders are quite heavyweight so retaining those in memory
   unnecesarily is also a hidden memory leak. Should I open a bug
 report
   on
   that one?
  
   /Rubén
  
   On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera ser...@gmail.com
 wrote:
  
Guess we were replying at the same time :).
   
On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler u...@thetaphi.de
   wrote:
   
 I already answered, that I will take care of this!

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Shai Erera [mailto:ser...@gmail.com]
  Sent: Thursday, April 08, 2010 12:00 PM
  To: java-u...@lucene.apache.org
  Subject: Re: IndexWriter memory leak?
 
  Yes, that's the trimBuffer version I was thinking about, only
   this guy
  created a reset(Reader, int) and does both ops (resetting +
 trim)
   in
  one
  method call. More convenient. Can you please open an issue to
   track
  that?
  People will have a chance to comment on whether we (Lucene)
   should
  handle
  that, or it should be a JFlex fix. Based on the number of
 replies
   this
  guy
  received (0 !), I doubt JFlex would consider it a problem.
 But we
   can
  do
  some small service to our users base by protecting against
 such
  problems.
 
  And while you're opening the issue, if you want to take a
 stab at
  fixing it
  and post a patch, it'd be great :).
 
  Shai
 
  On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
  ruben.lag...@gmail.comwrote:
 
   I was investigating this a little further and in the JFlex
   mailing
  list I
   found [1]
  
   I don't know much about flex / JFlex but it seems that this
 guy
  resets the
   zzBuffer to 16384 or less when setting the input for the
 lexer
  
  
   Quoted from  shef she...@ya...
  
  
   I set
  
   %buffer 0
  
   in the options section, and then added this method to the
   lexer:
  
  /**
   * Set the input for the lexer. The size parameter
 really
   speeds
  things
   up,
   * because by default, the lexer allocates an internal
   buffer of
  16k.
   For
   * most strings, this is unnecessarily large. If the
 size
   param is
   0 or greater
   * than 16k, then the buffer is set to 16k. If the size
   param is
   smaller, then
   * the buf will be set to the exact size.
   * 

Re: Changing the subject for a JIRA-issue (Was: [jira] Created: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into f

2010-04-08 Thread Chris Hostetter

:  Is it possible to change it? If not, what is the policy here? To open a
:  new issue and close the old one?
...
: In this case, that would mean either closing this issue and opening a new one,
: or taking the discussion to the mailing list where subject headers may be
: modified as the conversation evolves.  

Any one who can edit an issue (ie: all hte committers, and anyone in the 
developer group) can change the summary (which change the email 
subjects)

It's not clear to me what the summar of LUCENE-2335 should be, but 
McCandless opened the issue, he can certainly fix the summar as the issue 
evolves.




-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2386:
---

Attachment: LUCENE-2386.patch

First stab at this. Patch still missing CHANGES entry, and I haven't run all 
the tests, just TestIndexWriter. With those changes it passes. One thing that I 
think should be fixed is testImmediateDiskFull - if I don't add 
writer.commit(), the test fails, because dir.getRecomputeActualSizeInBytes 
returns 0 (no RAMFiles yet), and then the test succeeds at adding one document. 
So maybe just change the test to set maxSizeInBytes to '1', always?

TestNoDeletionPolicy is not covered by this patch (should be fixed as well, 
because now the number of commits is exactly N and not N+1). Will fix it 
tomorrow.

Anyway, it's really late now, so hopefully some fresh eyes will look at it 
while I'm away, and comment on the proposed changes. I hope I got all the 
changes to the tests right.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2074:
--

Attachment: LUCENE-2074.patch

New patch with replacement of deprecated TermAttribute - CharTermAttribute. It 
also fixes the reset()/reset(Reader) methods to be conform to all other 
Tokenizers and the documentations. The current one was resetting multiple 
times. This has no effect on backwards. Also improve the JFlex classpath 
detection to work with svn checkouts or future release zips.

I will commit this soon when all tests ran.

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Getting fsync out of the loop

2010-04-08 Thread Earwin Burrfoot
 But, IW doesn't let you hold on to checkpoints... only to commits.

 Ie SnapshotDP will only see actual commit/close calls, not
 intermediate checkpoints like a random segment merge completing, a
 flush happening, etc.

 Or... maybe you would in fact call commit frequently from the main
 threads (but with fsync disabled), and then your DP holds onto these
 fake commits, periodically picking one of them to do the real
 fsync ing?
Yeah, that's exactly what I tried to describe in my initial post :)

 I'm just playing around with stupid idea. I'd like to have NRT
 look-alike without binding readers and writers. :)
 I see... well binding durability  visibility will always be costly.
 This is why Lucene decouples them (by making NRT readers available).
 My experiments do the same, essentially.
 But after I understood that to perform deletions IW has to load term indexes
 anyway, I'm almost ready to give up and go for intertwined IW/IR mess :)
 Hey if you really think it's a mess, post a patch that cleans it up :)
Uh oh. Let me finish current one, first. Second - I don't know yet how
this should look like.
Something along the lines of deletions/norms writers being extracted
from segment reader
and reader pool being made external to IW??

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into field cache)

2010-04-08 Thread Michael McCandless
Actually Toke opened a new issue (LUCENE-2369) for the new approach to
Locale-based sorting... I think we should leave the existing issue as
the single-segment optimization (it's a separate issue).

Mike

On Thu, Apr 8, 2010 at 6:06 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  Is it possible to change it? If not, what is the policy here? To open a
 :  new issue and close the old one?
        ...
 : In this case, that would mean either closing this issue and opening a new 
 one,
 : or taking the discussion to the mailing list where subject headers may be
 : modified as the conversation evolves.

 Any one who can edit an issue (ie: all hte committers, and anyone in the
 developer group) can change the summary (which change the email
 subjects)

 It's not clear to me what the summar of LUCENE-2335 should be, but
 McCandless opened the issue, he can certainly fix the summar as the issue
 evolves.




 -Hoss



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855215#action_12855215
 ] 

Michael McCandless commented on LUCENE-2386:


I think the patch is good Shai.  I'd be curious what other tests rely on an 
immediate commit on creating an index

Maybe change testImmediateDiskFull to set max allowed size to max(1, 
current-usage)?  In case we change IW to write other stuff in the future on 
create...

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Getting fsync out of the loop

2010-04-08 Thread Michael McCandless
On Thu, Apr 8, 2010 at 6:21 PM, Earwin Burrfoot ear...@gmail.com wrote:
 But, IW doesn't let you hold on to checkpoints... only to commits.

 Ie SnapshotDP will only see actual commit/close calls, not
 intermediate checkpoints like a random segment merge completing, a
 flush happening, etc.

 Or... maybe you would in fact call commit frequently from the main
 threads (but with fsync disabled), and then your DP holds onto these
 fake commits, periodically picking one of them to do the real
 fsync ing?
 Yeah, that's exactly what I tried to describe in my initial post :)

Ahh ok then it makes more sense.  But still you shouldn't commit that
often (even with fake fsync) since it must flush the segment.

 I'm just playing around with stupid idea. I'd like to have NRT
 look-alike without binding readers and writers. :)
 I see... well binding durability  visibility will always be costly.
 This is why Lucene decouples them (by making NRT readers available).
 My experiments do the same, essentially.
 But after I understood that to perform deletions IW has to load term indexes
 anyway, I'm almost ready to give up and go for intertwined IW/IR mess :)
 Hey if you really think it's a mess, post a patch that cleans it up :)
 Uh oh. Let me finish current one, first.

Heh, yes :)

 Second - I don't know yet how
 this should look like.
 Something along the lines of deletions/norms writers being extracted
 from segment reader
 and reader pool being made external to IW??

Yeah, reader pool should be pulled out of IW, and I think IW should be
split into that which manages the segment infos, that which
adds/deletes docs, and the rest (merging, addIndexes*)?  (There's
an issue open for this refactoring...).

I'm not sure about deletions/norms writers being extracted from SR
I think delete ops would still go through IW?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Incremental Field Updates

2010-04-08 Thread Babak Farhang
Good point. I meant the model at the document level: i.e. what
milestones does a document go through in its life cycle. Today:

created -- deleted

With incremental updates:

created -- update1 -- update2 -- deleted

I think what I'm trying to say is that this second threaded sequence
of state changes seems intuitively more fragile under concurrent
scenarios.  So for example, in a lock-free design, the system would
also have to anticipate the following sequence of events:

created -- update1 -- deleted -- update2

and consider update2 a null op.  I'm imagining there are other cases
that I can't think of..

-Babak


On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 write once, plus the option to the app to keep multiple commit points
 around (by customizing the deletion policy).

 Actually order of operations / commits very much matters in Lucene today.

 Deletions are not idempotent: if you add a doc w/ term X, delete by
 term X, add a new doc with term X... that's very different than if you
 moved the delete op to the end.  Ie the deletion only applies to the
 docs added before it.

 Mike

 On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang farh...@gmail.com wrote:
 Sure. Because of the write once principle.  But at some cost
 (duplicated data). I was just agreeing that it would not be a good
 idea to bake in version-ing by keeping the layers around forever in a
 merged index; I wasn't keying in on transactions per se.

 Speaking of transactions: I'm not sure if we should worry about this
 much yet, but with updates the order of the transaction commits
 seems important. I think commit order is less important today in
 Lucene because its model supports only 2 types of events: document
 creation--which only happens once, and document deletion, which is
 idempotent.  What do you think? Will commits have to be ordered if we
 introduce updates?  Or does the onus of maintaining order fall on the
 application?

 -Babak

 On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang farh...@gmail.com wrote:
 I think they get merged in by the merger, ideally in the background.

 That sounds sensible. (In other words, we wont concern ourselves with
 roll backs--something possible while a layer is still around.)

 Actually roll backs would still be very possible even if layers are merged.

 Ie, one could keep multiple commits around, and the older commits
 would still be referring to the old postings + layers, keeping them
 alive.

 Lucene would still be transactional with such an approach.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855265#action_12855265
 ] 

Shai Erera commented on LUCENE-2386:


bq. Maybe change testImmediateDiskFull to set max allowed size to max(1, 
current-usage)?

Good idea ! Did it and it works.

Now ... one thing I haven't mentioned is the bw break. This is a behavioral bw 
break, which specifically I'm not so sure we should care about, because I 
wonder how many apps out there rely on being able to open a reader before they 
ever commited on a fresh new index. So what do you think - do this change 
anyway, OR ... utilize Version to our aid? I.e., if the Version that was passed 
to IWC is before LUCENE_31, we keep the initial commit, otherwise we don't do 
it? Pros is that I won't need to change many of the tests because they still 
use the LUCENE_30 version (but that is not a strong argument), so it's a weak 
Pro. Cons is that IW will keep having that doCommit handling in its ctor, only 
now w/ added comments on why this is being kept around etc.

What do you think?

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



TestCodecs running time

2010-04-08 Thread Shai Erera
Hi

I've noticed that TestCodecs takes an insanely long time to run on my
machine - between 35-40 seconds. Is that expected?
The reason why it runs so long, seems to be that its threads make (each)
4000 iterations ... is that really required to ensure correctness?

Shai


Controlling the maximum size of a segment during indexing

2010-04-08 Thread Lance Norskog
Here is a Java unit test that uses the LogByteSizeMergePolicy to
control the maximum size of segment files during indexing. That is, it
tries. It does not succeed. Will someone who truly understands the
merge policy code please examine it. There is probably one tiny
parameter missing.

It adds 20 documents that each are 100k in size.

It creates an index in a RAMDirectory which should have one segment
that's a tad over 1mb, and then a set of segments that are a tad over
500k. Instead, the data does not flush until it commits, writing one
5m segment.


-
org.apache.lucene.index.TestIndexWriterMergeMB
---

package org.apache.lucene.index;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the License); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldSelectorResult;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.LuceneTestCase;

/*
 * Verify that segment sizes are limited to # of bytes.
 *
 * Sizing:
 *  Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x)
 *  Min MB is 10k.
 *  Each document is 100k.
 *  mergeSegments=2
 *  MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x)
 *
 *  This test should cause the ram buffer to flush after 10 documents,
and create a CFS a little over 1meg.
 *  The later documents should be flushed to disk every 5-6 documents,
and create CFS files a little over 0.5meg.
 */


public class TestIndexWriterMergeMB extends LuceneTestCase {
  private static final int MERGE_FACTOR = 2;
  private static final double RAMBUFFER_MB = 1.0;
  static final double MIN_MB = 0.01d;
  static final double MAX_MB = 0.5d;
  static final double SLOP_FACTOR = 1.2d;
  static final double MB = 1000*1000;
  static String VALUE_100k = null;

  // Test controlling the mergePolicy for max # of docs
  public void testMaxMergeMB() throws IOException {
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(
TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT));

LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy();
config.setMergePolicy(mergeMB);
mergeMB.setMinMergeMB(MIN_MB);
mergeMB.setMaxMergeMB(MAX_MB);
mergeMB.setUseCompoundFile(true);
mergeMB.setMergeFactor(MERGE_FACTOR);
config.setMaxBufferedDocs(100);// irrelevant
but the next line fails without this.
config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
MergeScheduler scheduler = new SerialMergeScheduler();
config.setMergeScheduler(scheduler);
IndexWriter writer = new IndexWriter(dir, config);

System.out.println(Start indexing);
for (int i = 0; i  50; i++) {
  addDoc(writer, i);
  printSegmentSizes(dir);
}
checkSegmentSizes(dir);
System.out.println(Commit);
writer.commit();
printSegmentSizes(dir);
checkSegmentSizes(dir);
writer.close();
  }

  // document that takes of 100k of RAM
  private void addDoc(IndexWriter writer, int i) throws IOException {
if (VALUE_100k == null) {
  StringBuilder value = new StringBuilder(10);
  for(int fill = 0; fill  10; fill ++) {
value.append('a');
  }
  VALUE_100k = value.toString();
}
Document doc = new Document();
doc.add(new Field(id, i + , Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field(content, VALUE_100k, Field.Store.YES,
Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
  }


  private void checkSegmentSizes(Directory dir) {
try {
  String[] files = dir.listAll();
  for (String file : files) {
if (file.equals(_0.cfs)) {
  long length = dir.fileLength(file);
  assertTrue(First segment:  + file +  size =  + length +   
  + (int) ((SLOP_FACTOR * RAMBUFFER_MB) * MB), length 
(SLOP_FACTOR * RAMBUFFER_MB) * MB);
} else if (file.endsWith(.cfs)) {
  long length 

Re: Controlling the maximum size of a segment during indexing

2010-04-08 Thread Shai Erera
I'm not sure .. but did you set the RAMBufferSizeMB on IWC? Doesn't look
like it, and the default is 16 MB, which can explain why it doesn't flush
before that.

Shai

On Fri, Apr 9, 2010 at 8:01 AM, Lance Norskog goks...@gmail.com wrote:

 Here is a Java unit test that uses the LogByteSizeMergePolicy to
 control the maximum size of segment files during indexing. That is, it
 tries. It does not succeed. Will someone who truly understands the
 merge policy code please examine it. There is probably one tiny
 parameter missing.

 It adds 20 documents that each are 100k in size.

 It creates an index in a RAMDirectory which should have one segment
 that's a tad over 1mb, and then a set of segments that are a tad over
 500k. Instead, the data does not flush until it commits, writing one
 5m segment.


 -
 org.apache.lucene.index.TestIndexWriterMergeMB

 ---

 package org.apache.lucene.index;

 /**
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the License); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an AS IS BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */

 import java.io.IOException;

 import org.apache.lucene.analysis.WhitespaceAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.FieldSelectorResult;
 import org.apache.lucene.document.Field.Index;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.util.LuceneTestCase;

 /*
  * Verify that segment sizes are limited to # of bytes.
  *
  * Sizing:
  *  Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x)
  *  Min MB is 10k.
  *  Each document is 100k.
  *  mergeSegments=2
  *  MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x)
  *
  *  This test should cause the ram buffer to flush after 10 documents,
 and create a CFS a little over 1meg.
  *  The later documents should be flushed to disk every 5-6 documents,
 and create CFS files a little over 0.5meg.
  */


 public class TestIndexWriterMergeMB extends LuceneTestCase {
  private static final int MERGE_FACTOR = 2;
  private static final double RAMBUFFER_MB = 1.0;
  static final double MIN_MB = 0.01d;
  static final double MAX_MB = 0.5d;
  static final double SLOP_FACTOR = 1.2d;
  static final double MB = 1000*1000;
  static String VALUE_100k = null;

  // Test controlling the mergePolicy for max # of docs
  public void testMaxMergeMB() throws IOException {
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(
TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT));

LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy();
config.setMergePolicy(mergeMB);
mergeMB.setMinMergeMB(MIN_MB);
mergeMB.setMaxMergeMB(MAX_MB);
mergeMB.setUseCompoundFile(true);
mergeMB.setMergeFactor(MERGE_FACTOR);
config.setMaxBufferedDocs(100);// irrelevant
 but the next line fails without this.
config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
MergeScheduler scheduler = new SerialMergeScheduler();
config.setMergeScheduler(scheduler);
IndexWriter writer = new IndexWriter(dir, config);

System.out.println(Start indexing);
for (int i = 0; i  50; i++) {
  addDoc(writer, i);
  printSegmentSizes(dir);
}
checkSegmentSizes(dir);
System.out.println(Commit);
writer.commit();
printSegmentSizes(dir);
checkSegmentSizes(dir);
writer.close();
  }

  // document that takes of 100k of RAM
  private void addDoc(IndexWriter writer, int i) throws IOException {
if (VALUE_100k == null) {
  StringBuilder value = new StringBuilder(10);
  for(int fill = 0; fill  10; fill ++) {
value.append('a');
  }
  VALUE_100k = value.toString();
}
Document doc = new Document();
doc.add(new Field(id, i + , Field.Store.YES,
 Field.Index.NOT_ANALYZED));
doc.add(new Field(content, VALUE_100k, Field.Store.YES,
 Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
  }


  private void checkSegmentSizes(Directory dir) {
try {
  String[] files = dir.listAll();
  for (String file : files) {
if 

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855277#action_12855277
 ] 

Shai Erera commented on LUCENE-2386:


Apparently, there are more tests that fail ... lost count but easy fixing. I 
tried writing the following test:

{code}
  public void testNoCommits() throws Exception {
// Tests that if we don't call commit(), the directory has 0 commits. This 
has
// changed since LUCENE-2386, where before IW would always commit on a fresh
// new index.
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new 
IndexWriterConfig(TEST_VERSION_CURRENT, new 
WhitespaceAnalyzer(TEST_VERSION_CURRENT)));
assertEquals(expected 0 commits!, 0, IndexReader.listCommits(dir).size());
// No changes still should generate a commit, because it's a new index.
writer.close();
assertEquals(expected 1 commits!, 0, IndexReader.listCommits(dir).size());
  }
{code}

Simple test - validates that no commits are present following a freshly new 
index creation, w/o closing or committing. However, IndexReader.listCommits 
fails w/ the following exception:

{code}
java.io.FileNotFoundException: no segments* file found in 
org.apache.lucene.store.ramdirect...@2d262d26: files: []
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:652)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:323)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1033)
at 
org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1023)
at 
org.apache.lucene.index.IndexReader.listCommits(IndexReader.java:1341)
at 
org.apache.lucene.index.TestIndexWriter.testNoCommits(TestIndexWriter.java:4966)
   
{code}

The failure occurs when SegmentInfos attempts to find segments.gen and fails. 
So I wonder if I should fix DirectoryReader to catch that exception and simply 
return an empty Collection .. or I should fix SegmentInfos at this point -- 
notice the files: [] at the end - I think that by adding a check to the 
following code (SegmentInfos, line 652) which validates that there were any 
files before throwing the exception, it'll still work properly and safely (i.e. 
to detect a problematic Directory). Will need probably to break away from the 
while loop and I guess fix some other things in upper layers ... therefore I'm 
not sure if I should not simply catch that exception in 
DirectoryReader.listCommits w/ proper documentation and be done w/ it. After 
all, it's not supposed to be called ... ever? or hardly ever?

{code}
  if (gen == -1) {
// Neither approach found a generation
throw new FileNotFoundException(no segments* file found in  + 
directory + : files:  + Arrays.toString(files));
  }
{code}

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-08 Thread Shivender Devarakonda (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivender Devarakonda updated LUCENE-2376:
--

Attachment: CheckIndex_JavaHeapOOM.txt

CheckIndex output for JavaHeapOOM error. As I specified earlier, We saw OOM 
when it is indexing the data. I ran the checkIndex on the partially generated 
index folder.




 java.lang.OutOfMemoryError:Java heap space
 --

 Key: LUCENE-2376
 URL: https://issues.apache.org/jira/browse/LUCENE-2376
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1
 Environment: Windows
Reporter: Shivender Devarakonda
 Attachments: CheckIndex_JavaHeapOOM.txt, InfoStreamOutput.txt


 I see an OutOfMemory error in our product and it is happening when we have 
 some data objects on which we built the index. I see the following 
 OutOfmemory error, this is happening after we call Indexwriter.optimize():
 4/06/10 02:03:42.160 PM PDT [ERROR] [Lucene Merge Thread #12]  In thread 
 Lucene Merge Thread #12 and the message is 
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
 4/06/10 02:03:42.207 PM PDT [VERBOSE] [Lucene Merge Thread #12] [Manager] 
 Uncaught Exception in thread Lucene Merge Thread #12
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at java.util.HashMap.resize(HashMap.java:462)
   at java.util.HashMap.addEntry(HashMap.java:755)
   at java.util.HashMap.put(HashMap.java:385)
   at org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:256)
   at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:366)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
   at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:116)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:638)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:608)
   at 
 org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:686)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4979)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
 4/06/10 02:03:42.895 PM PDT [ERROR]  this writer hit an OutOfMemoryError; 
 cannot complete optimize

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2376) java.lang.OutOfMemoryError:Java heap space

2010-04-08 Thread Shivender Devarakonda (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivender Devarakonda updated LUCENE-2376:
--

Attachment: CheckIndex_PermGenSpaceOOM.txt

If we start our product with already generated index content  then we see and 
permgenspace OOM. I generated the CheckIndex on this index folder.

Please let me know your thoughts on these output files.

 java.lang.OutOfMemoryError:Java heap space
 --

 Key: LUCENE-2376
 URL: https://issues.apache.org/jira/browse/LUCENE-2376
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1
 Environment: Windows
Reporter: Shivender Devarakonda
 Attachments: CheckIndex_JavaHeapOOM.txt, 
 CheckIndex_PermGenSpaceOOM.txt, InfoStreamOutput.txt


 I see an OutOfMemory error in our product and it is happening when we have 
 some data objects on which we built the index. I see the following 
 OutOfmemory error, this is happening after we call Indexwriter.optimize():
 4/06/10 02:03:42.160 PM PDT [ERROR] [Lucene Merge Thread #12]  In thread 
 Lucene Merge Thread #12 and the message is 
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
 4/06/10 02:03:42.207 PM PDT [VERBOSE] [Lucene Merge Thread #12] [Manager] 
 Uncaught Exception in thread Lucene Merge Thread #12
 org.apache.lucene.index.MergePolicy$MergeException: 
 java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at java.util.HashMap.resize(HashMap.java:462)
   at java.util.HashMap.addEntry(HashMap.java:755)
   at java.util.HashMap.put(HashMap.java:385)
   at org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:256)
   at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:366)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
   at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:116)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:638)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:608)
   at 
 org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:686)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4979)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
 4/06/10 02:03:42.895 PM PDT [ERROR]  this writer hit an OutOfMemoryError; 
 cannot complete optimize

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org