[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-29 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693524#action_12693524
 ] 

Shai Erera commented on LUCENE-1575:


BooleanScorer defines an internal package private static final Collector class. 
Two questions:
# May I change it to BooleanCollector? (the name conflicts with the Collector 
name we want to give to all base collectors)
# May I change it to private static final? It is used only in BooleanScorer's 
newCollector() method.
I think the two are safe because it's already package-private and there's no 
other Lucene code which uses it.

BTW, we might wanna review BooleanScorer's internal classes visibility. They 
are all package-private, with some public methods, however used by 
BooleanScorer only ... But that's something for a different issue.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Possible IndexInput optimization

2009-03-29 Thread Paul Elschot
Earwin,

I did not experiment lately, but I'd like to add a general compressed
integer array to the basic types in an index, that would be compressed
on writing and decompressed on reading.

A first attempt is at LUCENE-1410, and one of the choices I had there
was whether or not to use NIO buffer methods on the index side.
I started there using these NIO buffer methods, but it seems that
the explicit byte arrays you're using here could be a good alternative.

I think my question boils down to whether or not these NIO buffers will
(in the end) get in the way of similar low level optimizations
you'd like to see applied here.

Regards,
Paul Elschot



On Sunday 29 March 2009 00:43:28 Earwin Burrfoot wrote:
 While drooling over MappedBigByteBuffer, which we'll (hopefully) see
 in JDK7, I revisited my own Directory code and noticed a certain
 peculiarity, shared by Lucene core classes:
 Each and every IndexInput implementation only implements readByte()
 and readBytes(), never trying to override readInt/VInt/Long/etc
 methods.
 
 Currently RAMDirectory uses a list of byte arrays as a backing store,
 and I got some speedup when switched to custom version that knows each
 file size beforehand and thus is able to allocate a single byte array
 (deliberately accepting 2Gb file size limitation) of exactly needed
 length. Nothing strange here, readByte(s) methods are easily most oft
 called ones in a Lucene app and they were greatly simplified -
 readByte became mere:
 public byte readByte() throws IOException {
 return buffer[position++]; // I dropped bounds checking, relying
 on natural ArrayIndexOOBE, we can't easily catch and recover from it
 anyway
 }
 
 But now, readInt is four readByte calls, readLong is two readInts (ten
 calls in total), readString - god knows how many. Unless you use a
 single type of Directory through the lifetime of your application,
 these readByte calls are never inlined, JIT invokevirtual
 short-circuit optimization (it skips method lookup if it always finds
 the same one during this exact invocation) cannot be applied too.
 
 There are three cases when we can override readNNN methods and provide
 implementations with zero or minimum method invocations -
 RAMDirectory, MMapDirectory and BufferedIndexInput for
 FSDirectory/CompoundFileReader. Anybody tried this?
 
 
 -- 
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 


[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Digy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693528#action_12693528
 ] 

Digy commented on LUCENE-1581:
--

I believe also that Character.toLowerCase in Java works ok, But the proplem is:

I -- i (in US)
I -- ı (in TR) .

So, I think, I should be able to choose the conversions.

DIGY.

 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-29 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693533#action_12693533
 ] 

Michael McCandless commented on LUCENE-1575:


bq. May I change it to BooleanCollector? (the name conflicts with the Collector 
name we want to give to all base collectors)

bq. May I change it to private static final? It is used only in BooleanScorer's 
newCollector() method.

I think these are fine.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Possible IndexInput optimization

2009-03-29 Thread Michael Busch

On 3/29/09 12:43 AM, Earwin Burrfoot wrote:

There are three cases when we can override readNNN methods and provide
implementations with zero or minimum method invocations -
RAMDirectory, MMapDirectory and BufferedIndexInput for
FSDirectory/CompoundFileReader. Anybody tried this?

   


A while ago I tried overriding the read* methods in BufferedIndexInput 
like this:


public int readVInt() throws IOException {
if (5 = (bufferLength-bufferPosition)) {
  return readVIntFast();
}
return super.readVInt();
}

private int readVIntFast() throws IOException {
byte b = buffer[bufferPosition++];
int i = b  0x7F;
for (int shift = 6; (b  0x80) != 0; shift += 7) {
  b = buffer[bufferPosition++];
  i |= (b  0x7F)  shift;
}
return i;
}


Notice that I don't rely on ArrayIndexOutOfBoundsException, instead I do 
one range check in readVInt() and then call the readVIntFast() method, 
which accesses the buffer array directly to avoid multiple range checks.


Surprisingly I did not see any performance improvement. In my test I 
wrote a huge file (several GBs) to disk with VInts, making sure they 
occupied more than just a single byte each. Reading the file with and 
without this optimization in BufferedIndexInput made almost no 
difference. Only when I ran it in a profiler I saw a big difference, 
because with his change there are less method calls, hence less 
invocation count overhead.


I'm still surprised there was no performance improvement at all. Maybe 
something was wrong with my test and I should try it again...


-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693540#action_12693540
 ] 

Shai Erera commented on LUCENE-1581:


From the javadocs 
(http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#toLowerCase(char)):

_In general, String.toLowerCase() should be used to map characters to 
lowercase. String case mapping methods have several benefits over Character 
case mapping methods. String case mapping methods can perform locale-sensitive 
mappings, context-sensitive mappings, and 1:M character mappings, whereas the 
Character case mapping methods cannot._

So I agree this is a problem, but I see no easy way (and efficient) to fix it. 
Suppose that we allow LowerCaseFilter to accept Locale. What would it do with 
it? We could add in LowerCaseFilter a MapLocale, char[65536] and allow one to 
pass in the Locale. If it's not null, and there's an entry in the map, lookup 
every character the filter receives. The lookup will be quite fast, as the 
character will serve as the index to the array (notice that we cover only 
2-byte characters though) and if it's \u we can assume there's no special 
handling and call Character.toLowerCase.

That is very fragile though as it's not easy to cover all the special case 
characters. Also, every time (including this one) we will find a special 
character that was not handled properly by the filter, it'd break back-compt, 
no?

BTW, when characters are uppercase, I don't think we have a problem, as they 
will always be lowercased to a single character (even if it's the wrong one, it 
will be consistent in indexing and search). The problem comes with the 
lowercase characters. The character \u0131 (lowercase I in Turkish) is 
lowercased to \u0131, while its uppercase version (I) is lowercased to 'i'. 
Therefore there is a mismatch and we'll fail if the user will enter a lowercase 
query (as they often do).

 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Possible IndexInput optimization

2009-03-29 Thread Earwin Burrfoot
 A while ago I tried overriding the read* methods in BufferedIndexInput like
 this:
 
 I'm still surprised there was no performance improvement at all. Maybe
 something was wrong with my test and I should try it again...

For BufferedIndexInput improvement should be noticeable only when the
file you're reading is loaded completely into OS disk cache (which was
not your case, I guess). Even then, you're making a syscall for each
1Kb chunk, it could probably dominate 1K method calls.
But for RAMDirectory/MMapDirectory you're not reading disk (if disk
cache kicked in), and you're not making syscalls. I guess I should
stop asking around and try.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Possible IndexInput optimization

2009-03-29 Thread Earwin Burrfoot
 Earwin,
 I did not experiment lately, but I'd like to add a general compressed
 integer array to the basic types in an index, that would be compressed
 on writing and decompressed on reading.
 A first attempt is at LUCENE-1410, and one of the choices I had there
 was whether or not to use NIO buffer methods on the index side.
 I started there using these NIO buffer methods, but it seems that
 the explicit byte arrays you're using here could be a good alternative.
 I think my question boils down to whether or not these NIO buffers will
 (in the end) get in the way of similar low level optimizations
 you'd like to see applied here.
 Regards,

 Paul Elschot
In my case I have to switch to MMap/Buffers, Java behaves ugly with
8Gb heaps. I'm thinking of trying to use Short/Int/LongBuffers that
wrap my initial ByteBuffer.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693545#action_12693545
 ] 

DM Smith commented on LUCENE-1581:
--

This a bit larger of a problem. It also pertains to upper casing, too.

I don't remember exactly, but I seem to remember that Java is behind with 
regard to the Unicode spec and Locale support (e.g. it does not include fa, 
farsi). I find that ICU4J keeps current with the spec.

I don't remember which way it goes, maybe it's both, but some Locales don't 
have a corresponding upper or lower case for some characters.

I'm not sure, but I think efficiency pertains to how it is normalized in 
Unicode (e.g. NFC, NFKC, NFD, or NFKD). These might produce different 
performance results.

(It is a different issue, but it is critical that the search requests perform 
the same Unicode normalization as the indes. As Lucene does not have these 
normalization filters, I find, I have to do this externally in my own filters 
using ICU.)

(Again a different issue: Another related kind of folding is that of base 10 
number shaping.)

Regarding: 
bq. I see no easy way (and efficient) to fix it. Suppose that we allow 
LowerCaseFilter to accept Locale. What would it do with it?

I think that we need Upper and Lower case filters that operates on the token as 
a whole, using the string-level method to do case conversion.

What I'd like to see is that lucene has a pluggable way to handle ICU, in so 
far as it does Locale specific things such as this. Such as using a base class 
UpperCaseFolder that provides the Java implementation, but that can take an 
alternate implementation, perhaps by reflection.




 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Digy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693553#action_12693553
 ] 

Digy commented on LUCENE-1581:
--

Although, it is not directly related to this issue, It is good to remember some 
existing problems in Lucene.
https://issues.apache.org/jira/browse/LUCENENET-51

DIGY


 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-03-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1516:
---

Attachment: ssd2.png


OK using the last patch, I ran another near real-time test, using this
alg:

{code}

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

merge.policy=org.apache.lucene.index.LogDocMergePolicy

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=10
max.field.length=2147483647

directory=FSDirectory
autocommit=false
compound=false
merge.factor = 10
ram.flush.mb = 128
doc.maker.forever = false
doc.random.id.limit = 3204040

work.dir=/lucene/work

{ BuildIndex
  - OpenIndex
  - NearRealtimeReader(1)
   { UpdateDocs UpdateDoc  : 10 : 50/sec
  - CloseIndex
}

RepSumByPrefRound BuildIndex
{code}

It opens a full (3.2M docs, previously built) wikipedia index, then
randomly selects a doc and updates it (deletes old, adds new) at the
rate of 50 docs/sec.  Then, once per second I open a new reader, do
the same search (term 1, sorted by date).

I attached another graph (ssd2.png) with the results, showing reopen 
search time as a function of how many updates have been done; rough
comments:

  * Search time is pretty constant ~35 msec, except occassional
glitches where it goes as high as ~340 msec.  Net/net very
reasonable I think.

  * Search time is remarkably non-noisy, except for occasional
spikes.

  * Reopen time is also fast (~ 40 msec) but is more noisy.

  * It's not clear the merges are really impacting things that much.
It could simply be that I didn't run test for long enough for a
big merge to run.  Also, this index has no stored fields nor term
vectors, so if we added those, merges would get slower.

  * This is a better test than last one, since it's doing some deletes

  * Since I open writer with autoCommit false, and near-realtime
carries all pending deletes in RAM, no *.del file ever gets
written to the index


 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693568#action_12693568
 ] 

Shai Erera commented on LUCENE-1581:


bq. What I'd like to see is that lucene has a pluggable way to handle ICU, in 
so far as it does Locale specific things such as this. Such as using a base 
class UpperCaseFolder that provides the Java implementation, but that can take 
an alternate implementation, perhaps by reflection.

Why do this? What prevents you in your application from creating such a filter? 
Lucene does not provide too many analyzers, or a single Analyzer for use by 
all, with configurable options. So why provide in Lucene a filter which uses 
ICU4J? I'm asking that for core Lucene. Of course such a module can sit in 
contrib, as do the other analyzers for other languages ...

BTW, I've had some experience with ICU4J and it had several performance issues, 
such as large consecutive array allocations. It also operates on strings, and 
does not have the efficient API Lucene has in tokenization (i.e., working on 
char[]).
However, I've worked with it long time ago, and perhaps things have changed 
since.

 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Possible IndexInput optimization

2009-03-29 Thread Paul Elschot
On Sunday 29 March 2009 13:47:59 Earwin Burrfoot wrote:
  Earwin,
  I did not experiment lately, but I'd like to add a general compressed
  integer array to the basic types in an index, that would be compressed
  on writing and decompressed on reading.
  A first attempt is at LUCENE-1410, and one of the choices I had there
  was whether or not to use NIO buffer methods on the index side.
  I started there using these NIO buffer methods, but it seems that
  the explicit byte arrays you're using here could be a good alternative.
  I think my question boils down to whether or not these NIO buffers will
  (in the end) get in the way of similar low level optimizations
  you'd like to see applied here.
  Regards,
 
  Paul Elschot
 In my case I have to switch to MMap/Buffers, Java behaves ugly with
 8Gb heaps. 

Do you mean that because garbage collection does not perform well
on these larger heaps, one should avoid to create arrays to have heaps
of that size, and rather use (direct) MMap/Buffers?

 I'm thinking of trying to use Short/Int/LongBuffers that
 wrap my initial ByteBuffer.

So far I have used an IntBuffer wrapping a ByteBuffer at LUCENE-1410.
In case arrays are better not created for data to be read from index,
I'll keep it that way, hoping that that doesn't run into backward
compatibility problems.

Regards,
Paul Elschot

 
 -- 
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 


Re: LockObtainFailedException exception

2009-03-29 Thread Ketan Deshpande
Hi Mike,
 
  Thanks for the response. I did a code check but this was a random error, 
which indicated towards something to do with the environment. Finally, I did 
figure out the problem - low disk space. Though there was around 1 GB of free 
space on the server, it was not sufficient when we had to merge a large number 
of indexes. Anyway, we have now done the needful and the problem hasnt recurred 
again!
 
Cheers,
Ketan

--- On Mon, 2/3/09, Michael McCandless luc...@mikemccandless.com wrote:


From: Michael McCandless luc...@mikemccandless.com
Subject: Re: LockObtainFailedException exception
To: java-dev@lucene.apache.org
Date: Monday, 2 March, 2009, 10:24 PM



Is it possible you accidentally allow two writers to try to open the index?

That would explain this failure; the 2nd writer would fail to acquire the lock, 
because the first writer has the index open.

Or, is it possible you're not closing a previously opened writer?

Mike

Ketan Deshpande wrote:

 Hi,
 
   I am fairly new to Lucene, so forgive my elaborate explanation. We were 
facing frequent issues with Lucene 1.2 (Unreleased write.lock() files). To 
overcome the same, we have recently upgraded to Lucene 2.3.2 - however, we 
observed the following LockObtainFailedException exception during our testing -
 
 2009-02-26 15:34:35,525 DEBUG 
 [com.eu.prnewswire.search.document.WDPIndexDocument] Document() called
 2009-02-26 15:34:35,529 DEBUG 
 [com.eu.prnewswire.search.document.WDPIndexDocument] adding associated type
 2009-02-26 15:34:35,529 DEBUG 
 [com.eu.prnewswire.search.document.WDPIndexDocument] added
 2009-02-26 15:34:36,535 ERROR [STDERR] 
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
 SimpleFSLock@/jboss/jboss-4.0.5.GA/spool/lucene/search1/index/PRNJ_2009_02/write.lock
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 org.apache.lucene.store.Lock.obtain(Lock.java:85)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:692)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:503)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 com.eu.prnewswire.search.index.LuceneIndex.addDocument(LuceneIndex.java:124)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.addToLuceneIndex(PRNJIndexerEJB.java:193)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.indexDocument(PRNJIndexerEJB.java:121)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 java.lang.reflect.Method.invoke(Method.java:324)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at 
 org.jboss.invocation.Invocation.performCall(Invocation.java:359)
 
   From the stack trace, we can trace back the exception to the following code 
in the IndexWriter class (while trying to acquire a lock):
 
 Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME);
 if (!writeLock.obtain(writeLockTimeout)) // obtain write lock
     throw new LockObtainFailedException(Index locked for write:  + 
writeLock);
 
   We have seen this issue only once till now and the files did not index 
until we deleted the lock file manually. (When I checked for existing issues, 
Lucene-715 came closest, but it has been resolved in 2.1 version) I am afraid 
this may crop up sometime again. Any inputs on how to resolve the the error 
would be appreciated. If any more details are required, I would be happy to 
share the same.
 
 Thanks,
 Ketan
 
 Bollywood news, movie reviews, film trailers and more! Click here.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




  Add more friends to your messenger and enjoy! Go to 
http://messenger.yahoo.com/invite/

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693579#action_12693579
 ] 

DM Smith commented on LUCENE-1581:
--

bq.Why do this?
Lucene has a bias toward English texts and does not have a fundamental 
architecture focused on internationalization and localization. IMHO, it should.

Java does not implement Unicode well and does not keep abreast with it's 
changes. It's not that ICU is the right solution. It is *a* robust solution.

bq. What prevents you in your application from creating such a filter?
Nothing at all. But I think that proper behavior regarding Unicode and locales 
is something that many want. Especially for non-English indexes. As such it 
belongs with Lucene not individual projects.

With that in mind, I think it would be great if Lucene were fully 
internationalized and localized, at least from a fundamental architecture 
perspective. (There is a separate issue on what core and contrib should be. I'm 
not clear where analyzers fall wrt that.)

As an implementation, if ICU is present it is used, with potential performance 
impacts, if not behavior degrades predictably and gracefully. This would create 
a quasi dependency not a hard one.

 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: LockObtainFailedException exception

2009-03-29 Thread Michael McCandless
Super, thanks for bringing closure!

Mike

On Sun, Mar 29, 2009 at 11:58 AM, Ketan Deshpande
ketandes...@yahoo.co.in wrote:
 Hi Mike,

   Thanks for the response. I did a code check but this was a random error,
 which indicated towards something to do with the environment. Finally, I did
 figure out the problem - low disk space. Though there was around 1 GB of
 free space on the server, it was not sufficient when we had to merge a large
 number of indexes. Anyway, we have now done the needful and the problem
 hasnt recurred again!

 Cheers,
 Ketan

 --- On Mon, 2/3/09, Michael McCandless luc...@mikemccandless.com wrote:

 From: Michael McCandless luc...@mikemccandless.com
 Subject: Re: LockObtainFailedException exception
 To: java-dev@lucene.apache.org
 Date: Monday, 2 March, 2009, 10:24 PM


 Is it possible you accidentally allow two writers to try to open the index?

 That would explain this failure; the 2nd writer would fail to acquire the
 lock, because the first writer has the index open.

 Or, is it possible you're not closing a previously opened writer?

 Mike

 Ketan Deshpande wrote:

 Hi,

   I am fairly new to Lucene, so forgive my elaborate explanation. We were
 facing frequent issues with Lucene 1.2 (Unreleased write.lock() files). To
 overcome the same, we have recently upgraded to Lucene 2.3.2 - however, we
 observed the following LockObtainFailedException exception during our
 testing -

 2009-02-26 15:34:35,525 DEBUG
 [com.eu.prnewswire.search.document.WDPIndexDocument] Document() called
 2009-02-26 15:34:35,529 DEBUG
 [com.eu.prnewswire.search.document.WDPIndexDocument] adding associated type
 2009-02-26 15:34:35,529 DEBUG
 [com.eu.prnewswire.search.document.WDPIndexDocument] added
 2009-02-26 15:34:36,535 ERROR [STDERR]
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:

 SimpleFSLock@/jboss/jboss-4.0.5.GA/spool/lucene/search1/index/PRNJ_2009_02/write.lock
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 org.apache.lucene.store.Lock.obtain(Lock.java:85)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:692)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:503)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 com.eu.prnewswire.search.index.LuceneIndex.addDocument(LuceneIndex.java:124)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at

 com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.addToLuceneIndex(PRNJIndexerEJB.java:193)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at

 com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.indexDocument(PRNJIndexerEJB.java:121)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 java.lang.reflect.Method.invoke(Method.java:324)
 2009-02-26 15:34:36,536 ERROR [STDERR]  at
 org.jboss.invocation.Invocation.performCall(Invocation.java:359)

   From the stack trace, we can trace back the exception to the following
 code in the IndexWriter class (while trying to acquire a lock):

 Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME);
 if (!writeLock.obtain(writeLockTimeout)) // obtain write lock
     throw new LockObtainFailedException(Index locked for write:  +
 writeLock);

   We have seen this issue only once till now and the files did not index
 until we deleted the lock file manually. (When I checked for existing
 issues, Lucene-715 came closest, but it has been resolved in 2.1 version) I
 am afraid this may crop up sometime again. Any inputs on how to resolve the
 the error would be appreciated. If any more details are required, I would be
 happy to share the same.

 Thanks,
 Ketan

 Bollywood news, movie reviews, film trailers and more! Click here.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 
 Add more friends to your messenger and enjoy! Invite them now.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693591#action_12693591
 ] 

Robert Muir commented on LUCENE-1581:
-

some comments I have on this topic:

the problems i have with default internationalization support in lucene revolve 
around the following:

1. breaking text into words (parsing) is not unicode-sensitive
i.e. if i have a word containing s + macron (s̄) it will not tokenize it 
correctly.

2. various filters like lowercase as mentioned here, but also accent removal 
are not unicode-sensitive
 i.e. if i have s + macron (s̄) it will not remove the macron.
this is not a normalization problem, but its true it also doesn't seem to work 
correctly on decomposed NF(K)D text for similar reasons. in this example, there 
is no composed form for s + macron available in unicode so I cannot 'hack' 
around the problem by running NFC on this text before i feed it to lucene.

3. unicode text must be normalized so that both queries and text are in a 
consistent representation.

one option I see is to have at least a basic analyzer that uses ICU to do the 
following.
1. Break text into words correctly.
2. common filters to do things like lowercase and accent-removal correctly.
3. uses a filter to normalize text to one unicode normal form (say, NFKC by 
default)

In my opinion, having this available would solve a majority of the current 
problems.

I kinda started trying to implement some of this with lucene-1488... (at least 
it does step 1!)



 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

2009-03-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1579:
---

Attachment: LUCENE-1579.patch

New patch, even simpler.

 Cloned SegmentReaders fail to share FieldCache entries
 --

 Key: LUCENE-1579
 URL: https://issues.apache.org/jira/browse/LUCENE-1579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1579.patch, LUCENE-1579.patch, LUCENE-1579.patch


 I just hit this on LUCENE-1516, which returns a cloned readOnly
 readers from IndexWriter.
 The problem is, when cloning, we create a new [thin] cloned
 SegmentReader for each segment.  FieldCache keys directly off this
 object, so if you clone the reader and do a search that requires the
 FieldCache (eg, sorting) then that first search is always very slow
 because every single segment is reloading the FieldCache.
 This is of course a complete showstopper for LUCENE-1516.
 With LUCENE-831 we'll switch to a new FieldCache API; we should ensure
 this bug is not present there.  We should also fix the bug in the
 current FieldCache API since for 2.9, users may hit this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Possible IndexInput optimization

2009-03-29 Thread Earwin Burrfoot
 In my case I have to switch to MMap/Buffers, Java behaves ugly with
 8Gb heaps.
 Do you mean that because garbage collection does not perform well
 on these larger heaps, one should avoid to create arrays to have heaps
 of that size, and rather use (direct) MMap/Buffers?
Yes, exactly. Keeping big Directories in heap is painful in many ways:
1. Old-gen GC is slow on big heaps. Our 3Gb heaps were collected for
6-8 seconds with parallel collector on four-way machines. Concurrent
collector consistently core dumps, whatever the settings :) Then we
tried increasing heaps (upto 8Gb) in pursuit of less machines in
cluster, and it just collected for eternity.
2. Eden-survivor-old chain is showering sparks around when you feed it
with huge arrays created in numbers. So your New-gen GCs are still
swift (100-200ms), but happen too often. As a consequence some of
short-lived objects start leaking into Old-gen.
3. You have to reserve place for merges. Fully optimizing index is
very taxing, I cheat by stopping accepting outside requests, switching
off memory cache, optimizing, then putting everything back in place.

I'm currently testing mmap approach, and despite Sun's braindead API,
it works like a charm.

While I'm at it, I got two more questions about MMapDirectory.
How often openInput() is called for a file? Is it worthy to do
getChannel().map() when file is written and closed, and then clone the
buffer for each openInput()?
Why don't you force() a newly-mapped Buffer? It will save first few
searches hitting a new segment from pagefaults and waiting for that
segment to be loaded.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store

2009-03-29 Thread Vaijanath N. Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693643#action_12693643
 ] 

Vaijanath N. Rao commented on LUCENE-1039:
--

Hi Karl,

Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). 
I am getting following error

Exception in thread main java.lang.NullPointerException
at 
org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552)
at 
org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94)
at 
org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139)
at 
org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54)
at 
org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71)
at 
org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72)
at 
org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64)

When I am trying to use the FSDirectory(). I created the instance Index as per 
the test sample and closed it. Now while doing a classification I am getting 
the above error.

The way I create the directory is:

FSDirectory dir = FSDirectory.getDirectory(new 
File(indexPath));
IndexWriter iw = new 
IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED);
iw.close();

The  code for addinig the instance is :
instances.addInstance(record.getText(), record.getClass());

instance.flush() and instance.close() all go fine.

While doing classification I again open the directory ( with just create set to 
false ) and rest call remains the same.

Instances instances = new Instances(dir, indexCreator.instanceFactory, class);
classifier = new NaiveBayesClassifier();
return classifier.classify(instances, text)[0].getClassification();

Can you help me in pointing out where I am doing wrong.

--Thanks and Regards
Vaijanath N. Rao




 Bayesian classifiers using Lucene as data store
 ---

 Key: LUCENE-1039
 URL: https://issues.apache.org/jira/browse/LUCENE-1039
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor
 Attachments: LUCENE-1039.txt


 Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
 Fisher method algorithms as described by Toby Segaran in Programming 
 Collective Intelligence, ISBN 978-0-596-52932-1. 
 Have fun.
 Poor java docs, but the TestCase shows how to use it:
 {code:java}
 public class TestClassifier extends TestCase {
   public void test() throws Exception {
 InstanceFactory instanceFactory = new InstanceFactory() {
   public Document factory(String text, String _class) {
 Document doc = new Document();
 doc.add(new Field(class, _class, Field.Store.YES, 
 Field.Index.NO_NORMS));
 doc.add(new Field(text, text, Field.Store.YES, Field.Index.NO, 
 Field.TermVector.NO));
 doc.add(new Field(text/ngrams/start, text, Field.Store.NO, 
 Field.Index.TOKENIZED, Field.TermVector.YES));
 doc.add(new Field(text/ngrams/inner, text, Field.Store.NO, 
 Field.Index.TOKENIZED, Field.TermVector.YES));
 doc.add(new Field(text/ngrams/end, text, Field.Store.NO, 
 Field.Index.TOKENIZED, Field.TermVector.YES));
 return doc;
   }
   Analyzer analyzer = new Analyzer() {
 private int minGram = 2;
 private int maxGram = 3;
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new StandardTokenizer(reader);
   ts = new LowerCaseFilter(ts);
   if (fieldName.endsWith(/ngrams/start)) {
 ts = new EdgeNGramTokenFilter(ts, 
 EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
   } else if (fieldName.endsWith(/ngrams/inner)) {
 ts = new NGramTokenFilter(ts, minGram, maxGram);
   } else if (fieldName.endsWith(/ngrams/end)) {
 ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
 minGram, maxGram);
   }
   return ts;
 }
   };
   public Analyzer getAnalyzer() {
 return analyzer;
   }
 };
 Directory dir = new RAMDirectory();
 new IndexWriter(dir, null, true).close();
 Instances instances = new Instances(dir, instanceFactory, class);
 instances.addInstance(hello world, en);
 instances.addInstance(hallå världen, sv);
 instances.addInstance(this is london calling, en);
 instances.addInstance(detta är london som ringer, sv);
 instances.addInstance(john has a long mustache, en);