Re: KeywordTokenizer isn't reusable

2007-12-17 Thread Michael McCandless


Yes please do!  Thanks.

Mike

TAKAHASHI hideaki wrote:


Hi, all

I found KeywordAnalyzer/KeywordTokenizer on trunk has a problem.

These have a condition(tokenStreams in Analyzer and done in  
KeywordTokenizer),

but these don't reset the condition. So KeywordAnalyzer can't analyze
a field more then twice.

I already created a patch for this problem.
Can I send this patch?

Thanks,
Hideaki

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-12-17 Thread Terry Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552434
 ] 

Terry Yang commented on LUCENE-588:
---

I wrote my first patch to this issue. if QueryParser knows the query is 
wildcard, it will directly pass the original query string to WildcardQuery 
which knows exactly which character is wildcard or not. i copied part of 
discardEscapeChar method from QueryParser because discardEscapeChar will throw 
ParseException which will causes WildcardQuery changed much. i am looking for a 
help/idea about what is the better way to process this exception?

 Escaped wildcard character in wildcard term not handled correctly
 -

 Key: LUCENE-588
 URL: https://issues.apache.org/jira/browse/LUCENE-588
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
 Environment: Windows XP SP2
Reporter: Sunil Kamath

 If an escaped wildcard character is specified in a wildcard query, it is 
 treated as a wildcard instead of a literal.
 e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
 discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-12-17 Thread Terry Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Yang updated LUCENE-588:
--

Attachment: LUCENE-588.patch

 Escaped wildcard character in wildcard term not handled correctly
 -

 Key: LUCENE-588
 URL: https://issues.apache.org/jira/browse/LUCENE-588
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
 Environment: Windows XP SP2
Reporter: Sunil Kamath
 Attachments: LUCENE-588.patch


 If an escaped wildcard character is specified in a wildcard query, it is 
 treated as a wildcard instead of a literal.
 e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
 discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Background Merges

2007-12-17 Thread Michael McCandless


Not good!

It's almost certainly a bug with Lucene, I think, because Solr is  
just a consumer of Lucene's API, which shouldn't ever cause something  
like this.


Apparently, while merging stored fields, SegmentMerger tried to read  
too far.


Is this easily repeatable?

Mike

Grant Ingersoll wrote:

I am running Lucene trunk with Solr and am getting the exception  
below when I call Solr's optimize.  I will see if I can isolate it  
to a test case, but thought I would throw it out there if anyone  
sees anything obvious.


In this case, I am adding documents sequentially and then at the  
end call Solr's optimize, which invokes Lucene's optimize.  The  
problem could be in Solr in that it's notion of commit does not  
play nice with Lucene's new merge policy.  However, I am posting  
here b/c the signs point to an issue in Lucene.


Cheers,
Grant


Exception in thread Thread-20 org.apache.lucene.index.MergePolicy 
$MergeException: java.io.IOException: read past EOF
at org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:274)

Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill 
(BufferedIndexInput.java:146)
at org.apache.lucene.store.BufferedIndexInput.readByte 
(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt 
(IndexInput.java:76)
at org.apache.lucene.index.FieldsReader.addFieldForMerge 
(FieldsReader.java:280)
at org.apache.lucene.index.FieldsReader.doc 
(FieldsReader.java:167)
at org.apache.lucene.index.SegmentReader.document 
(SegmentReader.java:659)
at org.apache.lucene.index.SegmentMerger.mergeFields 
(SegmentMerger.java:300)
at org.apache.lucene.index.SegmentMerger.merge 
(SegmentMerger.java:122)
at org.apache.lucene.index.IndexWriter.mergeMiddle 
(IndexWriter.java:3050)
at org.apache.lucene.index.IndexWriter.merge 
(IndexWriter.java:2792)
at org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:240)

Dec 17, 2007 1:44:26 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception:  
_3:C500 _4:C3 _l:C500 into _m [optimize]
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1744)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1684)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1664)
at org.apache.solr.update.DirectUpdateHandler2.commit 
(DirectUpdateHandler2.java:544)
at  
org.apache.solr.update.processor.RunUpdateProcessor.processCommit 
(RunUpdateProcessorFactory.java:85)
at org.apache.solr.handler.RequestHandlerUtils.handleCommit 
(RequestHandlerUtils.java:102)
at  
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody 
(XmlUpdateRequestHandler.java:113)
at org.apache.solr.handler.RequestHandlerBase.handleRequest 
(RequestHandlerBase.java:121)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:875)
at org.apache.solr.servlet.SolrDispatchFilter.execute 
(SolrDispatchFilter.java:283)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter 
(SolrDispatchFilter.java:234)
at org.mortbay.jetty.servlet.ServletHandler 
$CachedChain.doFilter(ServletHandler.java:1089)

...
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill 
(BufferedIndexInput.java:146)
at org.apache.lucene.store.BufferedIndexInput.readByte 
(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt 
(IndexInput.java:76)
at org.apache.lucene.index.FieldsReader.addFieldForMerge 
(FieldsReader.java:280)
at org.apache.lucene.index.FieldsReader.doc 
(FieldsReader.java:167)
at org.apache.lucene.index.SegmentReader.document 
(SegmentReader.java:659)
at org.apache.lucene.index.SegmentMerger.mergeFields 
(SegmentMerger.java:300)
at org.apache.lucene.index.SegmentMerger.merge 
(SegmentMerger.java:122)
at org.apache.lucene.index.IndexWriter.mergeMiddle 
(IndexWriter.java:3050)
at org.apache.lucene.index.IndexWriter.merge 
(IndexWriter.java:2792)
at org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:240)


Dec 17, 2007 1:44:26 PM org.apache.solr.core.SolrCore execute
INFO: [null] /update  
optimize=truewt=xmlwaitFlush=truewaitSearcher=trueversion=2.2 0  
1626

Dec 17, 2007 1:44:26 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception:  
_3:C500 _4:C3 _l:C500 into _m [optimize]
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1744)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1684)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1664)
at 

Re: [Lucene-java Wiki] Update of PoweredBy by PietSchmidt

2007-12-17 Thread Daniel Naber
On Montag, 17. Dezember 2007, Apache Wiki wrote:

 +  * [http://frauen-kennenlernen.com/ Frauen kennenlernen] - Search
 engine using Lucene

I don't claim that this is spam, but more and more of the Wiki PoweredBy 
links look like someone just wants a link from the Lucene project, 
probably to boost their Google ranking. We cannot tell whether these 
people really use Lucene at all, or if they use some blogging software 
which in turn uses Lucene (in that case it wouldn't make sense to link 
them from our page either).

My suggestion would be that we only accept links if people use Lucene 
directly (not via a software that has a Lucene-based search anyway) and 
that they put a link to Lucene on their imprint/contact page or on the 
search result page. On the other hand, while the page above is harmless, I 
guess it's not necessarily something Apache Lucene needs to be associated 
with.

Any suggestions?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1091) Big IndexWriter memory leak: when Field.Index.TOKENIZED

2007-12-17 Thread Mirza Hadzic (JIRA)
Big IndexWriter memory leak: when Field.Index.TOKENIZED
---

 Key: LUCENE-1091
 URL: https://issues.apache.org/jira/browse/LUCENE-1091
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.2
 Environment: Ubuntu Linux 7.10, 32-bit
Java 1.6.0 buld 1.6.0_03-b05 (default in Ubuntu 7.10)
1GB RAM
Reporter: Mirza Hadzic


This little program eats incrementally 2MB of virtual RAM per each 1000 
documents indexed, only when Field.Index.TOKENIZED used :

public Document getDoc() {
   Document document = new Document();
   document.add(new Field(foo, foo bar, Field.Store.NO, 
Field.Index.TOKENIZED));
   return document;
}

public Document run() {
   IndexWriter writer = new IndexWriter(new File(jIndexFileName), new 
StandardAnalyzer(), true);
   for (int i = 0; i  100; i++) {
  writer.addDocument(getDoc());
   }
}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-17 Thread Grant Ingersoll
I did hear back from the authors.  Some of the issues were based on  
values chosen for mergeFactor (10,000) I think, but there also seemed  
to be some questions about parsing the TREC collection.  It was split  
out into individual files, as opposed to trying to stream in the  
documents like we do with Wikipedia, so I/O overhead may be an issue.   
At the time, 1.9.1 did not have much TREC support, so splitting files  
is probably the easiest way to do it.  There indexing code was based  
off the demo and some LIA reading.


They thought they would try Lucene again when 2.3 comes out.  From our  
end, I think we need to improve the docs around mergeFactor.  We  
generally just say bigger is better, but my understanding is there is  
definitely a limit to this (100??  Maybe 1000) so we should probably  
suggest that in the docs.  And, of course, I think the new contrib/ 
benchmark has support for reading TREC (although I don't know if it  
handles streaming it) such that I think it shouldn't be a problem this  
time around.


At any rate, I think we are for the most part doing the right things.   
Anyone have any thoughts on advice about an upper bound for mergeFactor?


Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:


On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and  
more (except for automatically spawning extra indexing threads,  
which I agree would be a useful addition).  It does heavily depend  
on 1.5's java.util.concurrent package, though.  Many people seem  
like using Solr as an embedded library layer on top of Lucene to do  
it all in-process, as well.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-17 Thread Mark Miller
For the data that I normally work with (short articles), I found that 
the sweet spot was around 80-120. I actually saw a slight decrease going 
above that...not sure if that held forever though. That was testing on 
an earlier release  (I think 2.1?). However, if you want to test 
searching it would seem that you are going to want to optimize the 
index. I have always found that whatever I save by changing the merge 
factor is paid back when you optimize. I have not scientifically 
tested this, but found it to be the case in every speed test I ran. This 
is an interesting thing to me for this test. Do you test with a full 
optimize for indexing? If you don't, can you really test the search 
performance with the advantage of a full optimize? So, if you are going 
to optimize, why mess with the merge factor? It may still play a small 
role, but at best I think its a pretty weak lever.


- Mark

Grant Ingersoll wrote:
I did hear back from the authors.  Some of the issues were based on 
values chosen for mergeFactor (10,000) I think, but there also seemed 
to be some questions about parsing the TREC collection.  It was split 
out into individual files, as opposed to trying to stream in the 
documents like we do with Wikipedia, so I/O overhead may be an issue.  
At the time, 1.9.1 did not have much TREC support, so splitting files 
is probably the easiest way to do it.  There indexing code was based 
off the demo and some LIA reading.


They thought they would try Lucene again when 2.3 comes out.  From our 
end, I think we need to improve the docs around mergeFactor.  We 
generally just say bigger is better, but my understanding is there is 
definitely a limit to this (100??  Maybe 1000) so we should probably 
suggest that in the docs.  And, of course, I think the new 
contrib/benchmark has support for reading TREC (although I don't know 
if it handles streaming it) such that I think it shouldn't be a 
problem this time around.


At any rate, I think we are for the most part doing the right things.  
Anyone have any thoughts on advice about an upper bound for mergeFactor?


Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:


On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and 
more (except for automatically spawning extra indexing threads, which 
I agree would be a useful addition).  It does heavily depend on 1.5's 
java.util.concurrent package, though.  Many people seem like using 
Solr as an embedded library layer on top of Lucene to do it all 
in-process, as well.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: KeywordTokenizer isn't reusable

2007-12-17 Thread TAKAHASHI hideaki
Hi,
Here is the patch for KeywordAnalyzer, KeywordTokenizer, TestKeywordAnalyzer.

Thanks,
Hideaki,

On Dec 17, 2007 6:49 PM, Michael McCandless [EMAIL PROTECTED] wrote:

 Yes please do!  Thanks.

 Mike


 TAKAHASHI hideaki wrote:

  Hi, all
 
  I found KeywordAnalyzer/KeywordTokenizer on trunk has a problem.
 
  These have a condition(tokenStreams in Analyzer and done in
  KeywordTokenizer),
  but these don't reset the condition. So KeywordAnalyzer can't analyze
  a field more then twice.
 
  I already created a patch for this problem.
  Can I send this patch?
 
  Thanks,
  Hideaki
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
高橋 秀明
Index: src/test/org/apache/lucene/analysis/TestKeywordAnalyzer.java
===
--- src/test/org/apache/lucene/analysis/TestKeywordAnalyzer.java
(revision 605078)
+++ src/test/org/apache/lucene/analysis/TestKeywordAnalyzer.java
(working copy)
@@ -18,7 +18,10 @@
  */
 
 import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermDocs;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
@@ -61,4 +64,22 @@
   +partnum:Q36 +space, query.toString(description));
 assertEquals(doc found!, 1, hits.length());
   }
+
+  public void testMutipleDocument() throws Exception {
+RAMDirectory dir = new RAMDirectory();
+IndexWriter writer = new IndexWriter(dir,new KeywordAnalyzer(), true);
+Document doc = new Document();
+doc.add(new Field(partnum, Q36, Field.Store.YES, 
Field.Index.TOKENIZED));
+writer.addDocument(doc);
+doc = new Document();
+doc.add(new Field(partnum, Q37, Field.Store.YES, 
Field.Index.TOKENIZED));
+writer.addDocument(doc);
+writer.close();
+
+IndexReader reader = IndexReader.open(dir);
+TermDocs td = reader.termDocs(new Term(partnum, Q36));
+assertTrue(td.next());
+td = reader.termDocs(new Term(partnum, Q37));
+assertTrue(td.next());
+  }
 }
Index: src/java/org/apache/lucene/analysis/KeywordTokenizer.java
===
--- src/java/org/apache/lucene/analysis/KeywordTokenizer.java   (revision 
605078)
+++ src/java/org/apache/lucene/analysis/KeywordTokenizer.java   (working copy)
@@ -55,4 +55,9 @@
 }
 return null;
   }
+
+  public void reset(Reader input) throws IOException {
+super.reset(input);
+this.done = false;
+  }
 }
Index: src/java/org/apache/lucene/analysis/KeywordAnalyzer.java
===
--- src/java/org/apache/lucene/analysis/KeywordAnalyzer.java(revision 
605078)
+++ src/java/org/apache/lucene/analysis/KeywordAnalyzer.java(working copy)
@@ -17,6 +17,7 @@
  * limitations under the License.
  */
 
+import java.io.IOException;
 import java.io.Reader;
 
 /**
@@ -29,12 +30,13 @@
 return new KeywordTokenizer(reader);
   }
   public TokenStream reusableTokenStream(String fieldName,
- final Reader reader) {
+ final Reader reader) throws 
IOException {
 Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
 if (tokenizer == null) {
   tokenizer = new KeywordTokenizer(reader);
   setPreviousTokenStream(tokenizer);
-}
+} else
+   tokenizer.reset(reader);
 return tokenizer;
   }
 }
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: O/S Search Comparisons

2007-12-17 Thread Doron Cohen
On Dec 18, 2007 2:38 AM, Mark Miller [EMAIL PROTECTED] wrote:

 For the data that I normally work with (short articles), I found that
 the sweet spot was around 80-120. I actually saw a slight decrease going
 above that...not sure if that held forever though. That was testing on
 an earlier release  (I think 2.1?). However, if you want to test
 searching it would seem that you are going to want to optimize the
 index. I have always found that whatever I save by changing the merge
 factor is paid back when you optimize. I have not scientifically
 tested this, but found it to be the case in every speed test I ran. This
 is an interesting thing to me for this test. Do you test with a full
 optimize for indexing? If you don't, can you really test the search
 performance with the advantage of a full optimize? So, if you are going
 to optimize, why mess with the merge factor? It may still play a small
 role, but at best I think its a pretty weak lever.


I had similar experience - set merge factor  to ~maxint and optimized
at the end, and felt like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster to
merge on the fly?

(One comment - it is important to set back merge factor to a reasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)


 - Mark

 Grant Ingersoll wrote:
  I did hear back from the authors.  Some of the issues were based on
  values chosen for mergeFactor (10,000) I think, but there also seemed
  to be some questions about parsing the TREC collection.  It was split
  out into individual files, as opposed to trying to stream in the
  documents like we do with Wikipedia, so I/O overhead may be an issue.
  At the time, 1.9.1 did not have much TREC support, so splitting files
  is probably the easiest way to do it.  There indexing code was based
  off the demo and some LIA reading.
 
  They thought they would try Lucene again when 2.3 comes out.  From our
  end, I think we need to improve the docs around mergeFactor.  We
  generally just say bigger is better, but my understanding is there is
  definitely a limit to this (100??  Maybe 1000) so we should probably
  suggest that in the docs.  And, of course, I think the new
  contrib/benchmark has support for reading TREC (although I don't know
  if it handles streaming it) such that I think it shouldn't be a
  problem this time around.


Yes it does streaming -  TREC compressed files are read with GZIPInputStream
on demand - next doc's text is read/parsed only when the indexer requests
it,
and the indexable document is created, no doc files are created on disk.


 
  At any rate, I think we are for the most part doing the right things.
  Anyone have any thoughts on advice about an upper bound for mergeFactor?
 
  Cheers,
  Grant
 
 
  On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
 
  On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
 
  +1  I have been thinking about this too.  Solr clearly demonstrates
  the benefits of this kind of approach, although even it doesn't make
  it seamless for users in the sense that they still need to divvy up
  the docs on the app side.
 
  Would be nice if this layer also took care of searchers/readers
  refreshing  warming.
 
  Solr has well-tested code that provides all this functionality and
  more (except for automatically spawning extra indexing threads, which
  I agree would be a useful addition).  It does heavily depend on 1.5's
  java.util.concurrent package, though.  Many people seem like using
  Solr as an embedded library layer on top of Lucene to do it all
  in-process, as well.
 
  -Mike
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  --
  Grant Ingersoll
  http://lucene.grantingersoll.com
 
  Lucene Helpful Hints:
  http://wiki.apache.org/lucene-java/BasicsOfPerformance
  http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Updated: (LUCENE-1091) Big IndexWriter memory leak: when Field.Index.TOKENIZED

2007-12-17 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1091:


Attachment: TestOOM.java

Attached TestOMM, not reproducing the problem on XP, JRE 1.5

 Big IndexWriter memory leak: when Field.Index.TOKENIZED
 ---

 Key: LUCENE-1091
 URL: https://issues.apache.org/jira/browse/LUCENE-1091
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.2
 Environment: Ubuntu Linux 7.10, 32-bit
 Java 1.6.0 buld 1.6.0_03-b05 (default in Ubuntu 7.10)
 1GB RAM
Reporter: Mirza Hadzic
 Attachments: TestOOM.java


 This little program eats incrementally 2MB of virtual RAM per each 1000 
 documents indexed, only when Field.Index.TOKENIZED used :
 public Document getDoc() {
Document document = new Document();
document.add(new Field(foo, foo bar, Field.Store.NO, 
 Field.Index.TOKENIZED));
return document;
 }
 public Document run() {
IndexWriter writer = new IndexWriter(new File(jIndexFileName), new 
 StandardAnalyzer(), true);  
for (int i = 0; i  100; i++) {
   writer.addDocument(getDoc());
}
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1091) Big IndexWriter memory leak: when Field.Index.TOKENIZED

2007-12-17 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552634
 ] 

doronc edited comment on LUCENE-1091 at 12/17/07 10:09 PM:


I was not able to recreate this.

Can you run the attached TestOOM, and see how much memory is consumed and what 
used-memory stats gets printed?


  was (Author: doronc):
I was not able to recreate this.

Can you run the attached TestOOM (it expects a single indexDir argument on your 
system, and see how much memory is consumed and what used-memory stats gets 
printed?

  
 Big IndexWriter memory leak: when Field.Index.TOKENIZED
 ---

 Key: LUCENE-1091
 URL: https://issues.apache.org/jira/browse/LUCENE-1091
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.2
 Environment: Ubuntu Linux 7.10, 32-bit
 Java 1.6.0 buld 1.6.0_03-b05 (default in Ubuntu 7.10)
 1GB RAM
Reporter: Mirza Hadzic
 Attachments: TestOOM.java


 This little program eats incrementally 2MB of virtual RAM per each 1000 
 documents indexed, only when Field.Index.TOKENIZED used :
 public Document getDoc() {
Document document = new Document();
document.add(new Field(foo, foo bar, Field.Store.NO, 
 Field.Index.TOKENIZED));
return document;
 }
 public Document run() {
IndexWriter writer = new IndexWriter(new File(jIndexFileName), new 
 StandardAnalyzer(), true);  
for (int i = 0; i  100; i++) {
   writer.addDocument(getDoc());
}
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-12-17 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-588:
-

Lucene Fields: [Patch Available]

 Escaped wildcard character in wildcard term not handled correctly
 -

 Key: LUCENE-588
 URL: https://issues.apache.org/jira/browse/LUCENE-588
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
 Environment: Windows XP SP2
Reporter: Sunil Kamath
Assignee: Michael Busch
Priority: Minor
 Attachments: LUCENE-588.patch


 If an escaped wildcard character is specified in a wildcard query, it is 
 treated as a wildcard instead of a literal.
 e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
 discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-12-17 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-588:


Assignee: Michael Busch

 Escaped wildcard character in wildcard term not handled correctly
 -

 Key: LUCENE-588
 URL: https://issues.apache.org/jira/browse/LUCENE-588
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
 Environment: Windows XP SP2
Reporter: Sunil Kamath
Assignee: Michael Busch
 Attachments: LUCENE-588.patch


 If an escaped wildcard character is specified in a wildcard query, it is 
 treated as a wildcard instead of a literal.
 e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
 discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TeeTokenFilter performance testing

2007-12-17 Thread Karl Wettin


17 dec 2007 kl. 05.40 skrev Grant Ingersoll:

a somewhat common case whereby two or more fields share a fair  
number of common analysis steps.


Right.

For the smaller token counts, any performance difference is  
negligible.  However, even at 500 tokens, one starts to see a  
difference.  The first thing to note is that TeeTokenFilter (TTF) is  
much _slower_ in the case that all tokens are siphoned off (X = 1).   
I believe the reason is the cost of Token.clone()


I might be missing something here, but why do you clone? I usually  
fill a ListToken with the same instances and then only clone the  
tokens I need to update. The same token instances are used in multiple  
fields and queries at the same time and I never had any problems with  
that. Should I be expecting some?


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]