[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546148
 ] 

Michael Busch commented on LUCENE-584:
--

{quote}
1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching 
consistently.
2. introduce MatchFilter as superclass of Filter and add a minimal 
DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher
{quote}

Paul, I like the iterative plan you suggested. I started reviewing the
Matcher-20071122-1ground.patch. I've some question:
- Is the API fully backwards compatible?
- Did you make performance tests to check whether BitSetMatcher is 
slower than using a bitset directly?
- With just the mentioned patch applied I get compile errors, 
because the DefaultMatcher is missing. Could you provide a patch that
also includes the BitSetMatcher and Filter#getMatcher() returns it?
Also I believe the patch should modify Hits.java to use MatchFilter 
instead of Filter? And a unit test that tests the BitSetMatcher 
would be nice!

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546072
 ] 

Michael Busch commented on LUCENE-1058:
---

I'm quite busy currently with other stuff. Feel free to go ahead ;)

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546069
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

OK, looks good to me and is much simpler.  Only thing that gets complicated is 
the constructors, but that should be manageable.  Thanks for bearing w/ me :-)

One of you want to whip up a patch w/ tests or do you want me to do it?

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546062
 ] 

Michael Busch commented on LUCENE-1058:
---

I like the TeeTokenFilter! +1

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546058
 ] 

Yonik Seeley commented on LUCENE-1058:
--

I think having the "tee" solves the many-to-many case... you can have many 
fields contribute tokens to a new field.

{code}
ListTokenizer sink1 = new ListTokenizer(null);
ListTokenizer sink2 = new ListTokenizer(null);

TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new 
WhitespaceTokenizer(reader1), sink1), sink2);
TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new 
WhitespaceTokenizer(reader2), sink1), sink2);

// now sink1 and sink2 will both get tokens from both reader1 and reader2 after 
whitespace tokenizer
// now we can further wrap any of these in extra analysis, and more "tees" can 
be inserted if desired.

TokenStream final1 = new LowerCaseFilter(source1);
TokenStream final2 = source2;
TokenStream final3 = new EntityDetect(sink1);
TokenStream final4 = new URLDetect(sink2);

d.add(new Field("f1", final1));
d.add(new Field("f2", final2));
d.add(new Field("f3", final3));
d.add(new Field("f4", final4));
{code}

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546052
 ] 

Yonik Seeley commented on LUCENE-1058:
--

Very similar to what I came up with I think... (all untested, etc)

{code}
class ListTokenizer extends Tokenizer {
  protected List lst = new ArrayList();
  protected Iterator iter;

  public ListTokenizer(List input) {
this.lst = input;
if (this.lst==null) this.lst = new ArrayList();
  }

  /** only valid if tokens have not been consumed,
   * i.e. if this tokenizer is not part of another tokenstream
   */
  public List getTokens() {
return lst;
  }

  public Token next(Token result) throws IOException {
if (iter==null) iter = lst.iterator();
return iter.next();
  }

  /** Override this method to cache only certain tokens, or new tokens based
   * on the old tokens.
   */
  public void add(Token t) {
if (t==null) return;
lst.add((Token)t.clone());
  }

  public void reset() throws IOException {
iter = lst.iterator();
  }
}

class TeeTokenFilter extends TokenFilter {
  ListTokenizer sink;

  protected TeeTokenFilter(TokenStream input, ListTokenizer sink) {
super(input);
this.sink = sink;
  }

  public Token next(Token result) throws IOException {
Token t = input.next(result);
sink.add(t);
return t;
  }
}
{code}

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-27 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546051
 ] 

Doug Cutting commented on LUCENE-1044:
--

> How about if we don't sync every single commit point?

I'm confused.  The semantics of commit should be that all changes prior are 
made permanent, and no subsequent changes are permanent until the next commit.  
So syncs, if any, should map 1:1 to commits, no?  Folks can make indexing 
faster by committing/syncing less often.


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546050
 ] 

Michael Busch commented on LUCENE-1058:
---

We need to change the CachingTokenFilter a bit (untested code):

{code:java}
public class CachingTokenFilter extends TokenFilter {
  private List cache;
  private Iterator iterator;
  
  public CachingTokenFilter(TokenStream input) {
super(input);
this.cache = new LinkedList();
  }
  
  public Token next() throws IOException {
if (iterator != null) {
  if (!iterator.hasNext()) {
// the cache is exhausted, return null
return null;
  }   
  return (Token) iterator.next();
} else {
  Token token = input.next();
  addTokenToCache(token);
  return token;
}
  }
  
  public void reset() throws IOException {
if(cache != null) {
iterator = cache.iterator();
}
  }
  
  protected void addTokenToCache(Token token) {
if (token != null) {
  cache.add(token);
}
  }
}
{code}

Then you can implement the ProperNounTF:

{code:java}
class ProperNounTF extends CachingTokenFilter {
  protected void addTokenToCache(Token token) {
if (token != null && isProperNoun(token)) {
  cache.add(token);
}
  }
  
  private boolean isProperNoun() {...}
}  
{code}

And then you add everything to Document:

{code:java}
Document d = new Document();
TokenStream properNounTf = new ProperNounTF(new StandardTokenizer(reader));
TokenStream stdTf = new CachingTokenFilter(new StopTokenFilter(properNounTf));
TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf);


d.add(new Field("std", stdTf));
d.add(new Field("nouns", properNounTf));
d.add(new Field("lowerCase", lowerCaseTf));
{code}

Again, this is untested, but I believe should work? 

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546042
 ] 

Michael McCandless commented on LUCENE-1044:



How about if we don't sync every single commit point?

I think on a crash what's important when you come back up is 1) index
is consistent and 2) you have not lost that many docs from your index.
Losing the last N (up to mergeFactor) flushes might be acceptable?

EG we could force a full sync only when we commit the merge, before we
remove the merged segments.  This would mean on a crash that you're
"guaranteed" to have the last successfully committed & sync'd merge to
fall back to, and possibly a newer commit point if the OS had sync'd
those files on its own?

That would be a big simplification because I think we could just do
the sync() in the foreground since ConcurrentMergeScheduler is already
using BG threads to do merges.

This would also mean we cannot delete the commit points that were not
sync'd.  So the first 10 flushes would result in 10 segments_N files.
But then when the merge of these segments completes, and the result is
sync'd, those files could all be deleted.

Plus we would have to fix retry logic on loading the segments file to
try more than just the 2 most recent commit points but that's a pretty
minor change.

I think it should mean better performance, because the longer you wait
to call sync() presumably the more likely it is a no-op if the OS has
already sync'd the file.


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546040
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

OK, I am trying not be fixated on the Analyzer.   I guess I haven't fully 
synthesized the new TokenStream use in DocsWriter

I agree, I don't like the no-value Field, and am open to suggestions.

So, I guess I am going to push back and ask, how would you solve the case of 
where you have two fields and the Analysis given by:
source field:
StandardTokenizer
Proper Noun TF
LowerCaseTF
StopTF

buffered1 Field:
Proper Noun Cache TF  (cache of all terms found to be proper nouns by the 
Proper Noun TF)

buffered2 Field:
All terms lower cased

And the requirement is that you only do the Analysis phase once (i.e. for the 
source field) and the other two fields are from memory.

I am just not seeing it yet, so I appreciate the explanation as it will better 
cement my understanding of the new Token Stream stuff and DocsWriter



> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546039
 ] 

Michael McCandless commented on LUCENE-1044:


Woops, the last line in the table above is wrong (it's a copy of the line 
before it).  I'll re-run the test.

> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546031
 ] 

Michael Busch commented on LUCENE-1058:
---

I think the ideas here make sense, e. g. to have a buffering
TokenFilter that doesn't buffer all tokens but enables the 
user to control which tokens to buffer. 

What is still not clear to me is why we have to introduce a
new API for this and a new kind of analyzer? To allow creating
an no-value field seems strange. Can't we achieve all this
by using the Field(String, TokenStream) API without the
analyzer indirection?

The javadocs should make clear that the IndexWriter processes
fields in the same order the user added them. So if a user 
adds TokenStream ts1 and thereafter ts2, they can be sure 
that ts1 is processed first. With that knowledge ts1 can
buffer certain tokens that ts2 uses then. Adding even more
fields that use the same tokens is straightforward.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546028
 ] 

Yonik Seeley commented on LUCENE-1058:
--

I dunno... it feels like we should have the right generic solution 
(many-to-many) before committing anything in this case, simply because this is 
all user-level code (the absence of this patch doesn't prohibit the user from 
doing anything... no package protected access rights are needed, etc).


> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546022
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

Any objection to me committing the CachedAnalyzer and CachedTokenizer pieces of 
this patch, as I don't think they are effected by the other parts of this and 
they solve the pre-analysis portion of this discussion.  In the meantime, I 
will think some more about the generic field case, as I do think it is useful.  
I am also trying out some basic benchmarking on this.

{quote}
Things like entity extraction are normally not done by lucene analyzers AFAIK
{quote}

Consider yourself in the "know" now, as I have done this on a few occasions, 
but, yes, I do agree a one to many approach is probably better if it can be 
done in a generic way.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546004
 ] 

Yonik Seeley commented on LUCENE-1058:
--

{quote}To some extent, I was thinking that this could help optimize Solr's 
copyField mechanism.{quote}
Maybe... it would take quite a bit of work to automate it though I think.

As far as pre-analysis costs. iteration is pretty much free in comparison to 
everything else.  Memory is the big factor.

Things like entity extraction are normally not done by lucene analyzers 
AFAIK... but if one wanted a framework to do that, the problem is more generic. 
 Your really want to be able to add to multiple fields from multiple other 
fields.


> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546002
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

{quote}
What if they wanted 3 fields instead of two?
{quote}
True.  I'll have to think about a more generic approach.  In some sense, I 
think 2 is often sufficient, but you are right it isn't totally generic in the 
spirit of Lucene.  

To some extent, I was thinking that this could help optimize Solr's copyField 
mechanism.  In Solr's case, I think you often have copy fields that have 
marginal differences in the filters that are applied.  It would be useful for 
Solr to be able to optimize these so that it doesn't have to go through the 
whole analysis chain again.

{quote}
Isn't this what your current code does?
{quote}
No, in my main use case (# of buffered tokens is << # of source tokens) the 
only tokens kept around is the (much) smaller subset of buffered tokens.  In 
the pre-analysis approach you have to keep the source field tokens and the 
buffered tokens.  Not to mention that you are increasing the work by having to 
iterate over the cached tokens in the list in Lucene.  Thus, you have the cost 
of the analysis in your application plus the storage of both token lists (one 
large, one small, likely) then in Lucene you have the cost of iterating over 
two lists.  In my approach, I think, you have the cost of analysis plus the 
cost of storage of one list of tokens (small) and the cost of iterating that 
list.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545999
 ] 

Yonik Seeley commented on LUCENE-1058:
--

{quote}As for the convoluted cross-field logic, I don't think it is all that 
convoluted.{quote}

But it's baked into CollaboratingAnalyzer... it seems like this is better left 
to the user.  What if they wanted 3 fields instead of two?

{quote}
I do agree somewhat about the pre-analysis approach, except for the case where 
there may be a large number of tokens in the source field, in which case, you 
are holding them around in memory
{quote}
Isn't this what your current code does?


> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545995
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

{quote}
Maybe I'm missing something?
{quote}

No, I don't think you are missing anything in that use case, it's just an 
example of its use.  And I am not totally sold on this approach, but mostly am 
:-) 

I had originally considered your option, but didn't feel it was satisfactory 
for the case where you are extracting things like proper nouns or maybe it is 
generating a category value.  The more general case is where not all the tokens 
are needed (in fact, very few are).  In those cases, you have to go back 
through the whole list of cached tokens in order to extract the ones you want.  
In fact, thinking some more of on it, I am not sure my patch goes far enough in 
the sense that what if you want it to buffer in mid stream.  

For example, if you had:
StandardTokenizer
Proper Noun TF
LowerCaseTF
StopTF

and Proper Noun TF is solely responsible for setting aside proper nouns as it 
comes across them in the stream.

As for the convoluted cross-field logic, I don't think it is all that 
convoluted.  There are only two fields and the implementing Analyzer takes care 
of all of it.  Only real requirement the application has is that the fields be 
ordered correctly.  

I do agree somewhat about the pre-analysis approach, except for the case where 
there may be a large number of tokens in the source field, in which case, you 
are holding them around in memory (maxFieldLength mitigates to some extent.)  
Also, it puts the onus on the app. writer to do it, when it could be pretty 
straight forward for Lucene to do it w/o it's usual analysis pipeline.

At any rate, separate of the CollaboratingAnalyzer, I do think the 
CachedTokenFilter is useful, especially in supporting the pre-analysis approach.



> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1044:
---

Attachment: LUCENE-1044.take4.patch

OK I did a simplistic patch (attached) whereby FSDirectory has a
background thread that re-opens, syncs, and closes those files that
Lucene has written.  (I'm using a modified version of the class from
Doron's test).

This patch is nowhere near ready to commit; I just coded up enough so
we could get a rough measure of performance cost of syncing.  EG we
must prevent deletion of a commit point until a future commit point is
fully sync'd to stable storage; we must also take care not to sync a
file that has been deleted before we sync'd it; don't sync until the
end when running with autoCommit=false; merges if run by
ConcurrentMergeScheduler should [maybe] sync in the foreground; maybe
forcefully throttle back updates if syncing is falling too far behind;
etc.

I ran the same alg as the tests above (index first 150K docs of
Wikipedia).  I ran CFS and no CFS X sync and nosync (4 tests) for each
IO system.  Time is the fastest of 2 runs:

|| IO System || CFS sync || CFS nosync || CFS % slower || non-CFS sync || 
non-CFS nosync || non-CFS % slower ||
| ReiserFS 6-drive RAID5 array Linux (2.6.22.1) | 188 | 157 | 19.7% | 143 | 147 
| -2.7% |
| EXT3 single internal drive Linux (2.6.22.1) | 173 | 157 | 10.2% | 136 | 132 | 
3.0% |
| 4 drive RAID0 array Mac Pro (10.4 Tiger) | 153 | 152 | 0.7% | 150 | 149 | 
0.7% |
| Win XP Pro laptop, single drive | 463 | 352 | 31.5% | 343 | 335 | 2.4% |
| Mac Pro single external drive | 463 | 352 | 31.5% | 343 | 335 | 2.4% |

The good news is, the non-CFS case shows very little cost when we do
BG sync'ing!

The bad news is, the CFS case still shows a high cost.  However, by
not sync'ing the files that go into the CFS (and also not committing a
new segments_N file until after the CFS is written) I expect that cost
to go way down.

One caveat: I'm using a 8 MB RAM buffer for all of these tests.  As
Yonik pointed out, if you have a smaller buffer, or, you add just a
few docs and then close your writer, the sync cost as a pctg of net
indexing time will be quite a bit higher.


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545966
 ] 

Yonik Seeley commented on LUCENE-1058:
--

Maybe I'm not looking at it the right way yet, but I'm not sure this feels 
"right"...
Since Field has a tokenStreamValue(), wouldn't it be easiest to just use that?
If the tokens of two fields are related, one could just pre-analyze those 
fields and set the token streams appropriately.  Seems more flexible and keeps 
any convoluted cross-field logic in the application domain.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545959
 ] 

Michael Busch commented on LUCENE-1058:
---

Grant,

I'm not sure why we need this patch.

For the testcase that you're describing:
{quote}
For example, if you want to have two fields, one lowercased and one not, but 
all the other analysis is the same, then you could save off the tokens to be 
output for a different field.
{quote}

can't you simply do something like this:
{code:java}
Document d = new Document();
TokenStream t1 = new CachingTokenFilter(new WhitespaceTokenizer(reader));
TokenStream t2 = new LowerCaseFilter(t1);
d.add(new Field("f1", t1));
d.add(new Field("f2", t2));
{code}

Maybe I'm missing something?


> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone

2007-11-27 Thread Mike Baroukh (JIRA)
DateTools with DAY resoltion dosn't work depending on your timezone
---

 Key: LUCENE-1070
 URL: https://issues.apache.org/jira/browse/LUCENE-1070
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.2
Reporter: Mike Baroukh


Hi.

There is another issue, closed, that introduced a bug : 
https://issues.apache.org/jira/browse/LUCENE-491

Here is a simple TestCase :

DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm");
Date d1 = df.parse("10/10/2008 10:00");
System.err.println(DateTools.dateToString(d1, Resolution.DAY));
Date d2 = df.parse("10/10/2008 00:00");
System.err.println(DateTools.dateToString(d2, Resolution.DAY));

this output :

20081010
20081009

So, days are the same, but with DAY resolution, the value indexed doesn't refer 
to the same day.
This is because of DateTools.round() : using a Calendar initialised to GMT can 
make that the Date given is on yesterday depending on my timezone .

The part I don't  understand is why take a date for inputfield then convert it 
to calendar then convert it again before printing ?
This operation is supposed to "round" the date but using simply DateFormat to 
format the date and print only wanted fields do the same work, isn't it ?

The problem is : I see absolutly no solution actually. We could have a 
WorkAround if datetoString() took a Date as inputField but with a long, the 
timezone is lost.
I also suppose that the correction made on the other issue 
(https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug 
because it correct only for those who use date with a different timezone than 
the local timezone of the JVM.

So, my solution : add a DateTools.dateToString() that take a Date in parameter 
and deprecate the version that use a long.










-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1058:


Attachment: LUCENE-1058.patch

fixed a failing test

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption

2007-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545921
 ] 

Michael McCandless commented on LUCENE-1069:


This is the thread that spawned this issue:

http://www.gossamer-threads.com/lists/lucene/java-user/55124

> CheckIndex incorrectly sees deletes as index corruption
> ---
>
> Key: LUCENE-1069
> URL: https://issues.apache.org/jira/browse/LUCENE-1069
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1069.patch
>
>
> There is a silly bug in CheckIndex whereby any segment with deletes is
> considered corrupt.
> Thanks to Bogdan Ghidireac for reporting this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1069.


Resolution: Fixed

I just committed this.  Thanks for catching this & reporting it Bogdan.

> CheckIndex incorrectly sees deletes as index corruption
> ---
>
> Key: LUCENE-1069
> URL: https://issues.apache.org/jira/browse/LUCENE-1069
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1069.patch
>
>
> There is a silly bug in CheckIndex whereby any segment with deletes is
> considered corrupt.
> Thanks to Bogdan Ghidireac for reporting this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-935) Improve maven artifacts

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545919
 ] 

Grant Ingersoll commented on LUCENE-935:


Done.

> Improve maven artifacts
> ---
>
> Key: LUCENE-935
> URL: https://issues.apache.org/jira/browse/LUCENE-935
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Attachments: lucene-935-new.patch, lucene-935-rename-poms.patch, 
> lucene-935.patch
>
>
> There are a couple of things we can improve for the next release:
> - "*pom.xml" files should be renamed to "*pom.xml.template"
> - artifacts "lucene-parent" should extend "apache-parent"
> - add source jars as artifacts
> - update  task to work with latest version of 
> maven-ant-tasks.jar
> - metadata filenames should not contain "local"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1067.


Resolution: Fixed

I just committed this.  Thanks Grant for catching this.

> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1067.patch
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545909
 ] 

Michael McCandless commented on LUCENE-1067:


Thanks for the review Yonik!  I'll commit shortly.

> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1067.patch
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545904
 ] 

Yonik Seeley commented on LUCENE-1067:
--

Looks good, +1

> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1067.patch
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Yonik Seeley
On Nov 27, 2007 11:20 AM, robert engels <[EMAIL PROTECTED]> wrote:
> Can you describe exactly how the lockless commits affects this? Or
> could a reader be accessing the same RAMFile as a writer?

No read/commit lock exists any more... so a writer could be in the
process of writing the segments.nnn file and the reader may try to
read it.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1069:
---

Attachment: LUCENE-1069.patch

Attached patch (with new unit test) fixes it.  I plan to commit
shortly...


> CheckIndex incorrectly sees deletes as index corruption
> ---
>
> Key: LUCENE-1069
> URL: https://issues.apache.org/jira/browse/LUCENE-1069
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1069.patch
>
>
> There is a silly bug in CheckIndex whereby any segment with deletes is
> considered corrupt.
> Thanks to Bogdan Ghidireac for reporting this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption

2007-11-27 Thread Michael McCandless (JIRA)
CheckIndex incorrectly sees deletes as index corruption
---

 Key: LUCENE-1069
 URL: https://issues.apache.org/jira/browse/LUCENE-1069
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.3
 Attachments: LUCENE-1069.patch

There is a silly bug in CheckIndex whereby any segment with deletes is
considered corrupt.

Thanks to Bogdan Ghidireac for reporting this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread robert engels
Can you describe exactly how the lockless commits affects this? Or  
could a reader be accessing the same RAMFile as a writer?


Seems that this really deviates from the simplicity of the write-once  
design of the original Lucene.


Do writers share the same underlying RAMDirectory? Seems this would  
cause a lot of contention.


Or point me to the relevant documentation.


On Nov 27, 2007, at 10:11 AM, Michael McCandless (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-1067? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12545885 ]


Michael McCandless commented on LUCENE-1067:


OK I think this is just a thread safety issue on RAMFile.

That class has these comments:

  // Only one writing stream with no concurrent reading streams, so  
no file synchronization required


  // Direct read-only access to state supported for streams since a  
writing stream implies no other concurrent streams


which were true before lockless commits but after lockless commits are
not true, specifically for the segments_N and segments.gen files.

I think this fix is to make "ArrayList buffers" private (it's package
private now), add methods to get a buffer & get the number of buffers,
and make sure all methods that access "buffers" are synchronized.



TestStressIndexing has intermittent failures


Key: LUCENE-1067
URL: https://issues.apache.org/jira/browse/ 
LUCENE-1067

Project: Lucene - Java
 Issue Type: Bug
   Reporter: Grant Ingersoll
   Assignee: Michael McCandless
   Priority: Minor
Fix For: 2.3


See http://www.gossamer-threads.com/lists/lucene/java-dev/55092  
copied below:

 OK, I have seen this twice in the last two days:
Testsuite: org.apache.lucene.index.TestStressIndexing
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
sec
[junit]
[junit] - Standard Output ---
[junit] java.lang.NullPointerException
[junit] at
org.apache.lucene.store.RAMInputStream.readByte 
(RAMInputStream.java:67)

[junit] at
org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
[junit] at org.apache.lucene.index.SegmentInfos
$FindSegmentsFile.run(SegmentInfos.java:544)
[junit] at
org
.apache
.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
[junit] at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
[junit] at org.apache.lucene.index.TestStressIndexing
$SearcherThread.doWork(TestStressIndexing.java:111)
[junit] at org.apache.lucene.index.TestStressIndexing
$TimedThread.run(TestStressIndexing.java:55)
[junit] -  ---
[junit] Testcase:
testStressIndexAndSearching
(org.apache.lucene.index.TestStressIndexing): FAILED
[junit] hit unexpected exception in search1
[junit] junit.framework.AssertionFailedError: hit unexpected
exception in search1
[junit] at
org
.apache
.lucene.index.TestStressIndexing.runStressTest 
(TestStressIndexing.java:

159)
[junit] at
org
.apache
.lucene
.index
.TestStressIndexing
.testStressIndexAndSearching(TestStressIndexing.java:187)
[junit]
[junit]
[junit] Test org.apache.lucene.index.TestStressIndexing FAILED
Subsequent runs have, however passed. Has anyone else hit this on
trunk?
I am running using "ant clean test"
I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
how to reproduce at this point, but strikes me as a threading issue.
Oh joy!
I'll try to investigate more tomorrow to see if I can dream up a test
case.
-Grant


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1067:
---

Attachment: LUCENE-1067.patch

Attached patch.  All tests pass. I plan to commit in a day or two.

With this fix I can't get the test to fail after running 90+ times on
the MacPro quad.


> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1067.patch
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545885
 ] 

Michael McCandless commented on LUCENE-1067:


OK I think this is just a thread safety issue on RAMFile.

That class has these comments:

  // Only one writing stream with no concurrent reading streams, so no file 
synchronization required

  // Direct read-only access to state supported for streams since a writing 
stream implies no other concurrent streams

which were true before lockless commits but after lockless commits are
not true, specifically for the segments_N and segments.gen files.

I think this fix is to make "ArrayList buffers" private (it's package
private now), add methods to get a buffer & get the number of buffers,
and make sure all methods that access "buffers" are synchronized.


> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1058:


Attachment: LUCENE-1058.patch

Added some more documentation, plus a test showing it is bad to use the no 
value Field constructor w/o support from the Analyzer to produce tokens.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545846
 ] 

gsingers edited comment on LUCENE-1058 at 11/27/07 6:11 AM:
---

Added some more documentation, plus a test showing it is bad to use the no 
value Field constructor w/o support from the Analyzer to produce tokens.

If no objections, I will commit on Thursday or Friday of this week.

  was (Author: gsingers):
Added some more documentation, plus a test showing it is bad to use the no 
value Field constructor w/o support from the Analyzer to produce tokens.
  
> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential bug in StandardTokenizerImpl

2007-11-27 Thread Eugenio Martinez

 I am the guy who throw the question about the Acronym - Host detection anomaly 
in the StandardAnalyzer class.

Thanks to Shai Erera for traslating the discussion into the developers' list. I 
am surprised about Chris Hostetter's response, as this issue was treated by 
Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now, 
Lucene in Action, trying to override this issue, but i can't believe that this 
wasn't fixed yet.

As i explained at the user's list, i've found that indexing fails to include 
certain emails and words that are present in the logfile when i launch an 
IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I 
got the acronyms' interpretation issue. Maybe there will be more hidden 
anomalies in the StandardAnalyzer behavior with such a hughe load.

At this moment I can say this behavior is deterministic, so I can reproduce it 
over subsequent index and search calls, and takes place with the same words and 
emails over and over. Should it be a collateral efect of document vectorization 
as the logs are not natural language? As Lucene computes if the token conveys 
relevant info (as the vector space model states), what about that Lucene 
decided about the token not to be relevant? All of this supossing it works 
well, of course...

Any idea about this, or have you heard about?

Thanks and regards.

Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206FAX: 981 173 223

VIDEOCONFERENCIA: 981 173 596 

[EMAIL PROTECTED]






   
__ 
¿Chef por primera vez?
Sé un mejor Cocinillas. 
http://es.answers.yahoo.com/info/welcome

Re: (LUCENE-1067) - Make TopDocs constructor public

2007-11-27 Thread Shai Erera
Ooops, the issue number is 1064, not 1067.
sorry for the confusion.

On Nov 27, 2007 2:10 PM, Shai Erera <[EMAIL PROTECTED]> wrote:

> Hey guys,
>
> No one has commented on this feature yet. The change is very simple. I
> don't mind doing it myself, if you explain me the process ... do I just
> commit the change and then one of the committers need to approve, or my part
> in this issue is the patch I sent?
>
> Cheers,
>
> Shai Erera
>



-- 
Regards,

Shai Erera


(LUCENE-1067) - Make TopDocs constructor public

2007-11-27 Thread Shai Erera
Hey guys,

No one has commented on this feature yet. The change is very simple. I don't
mind doing it myself, if you explain me the process ... do I just commit the
change and then one of the committers need to approve, or my part in this
issue is the patch I sent?

Cheers,

Shai Erera


Re: Potential bug in StandardTokenizerImpl

2007-11-27 Thread Shai Erera
Ok

I opened https://issues.apache.org/jira/browse/LUCENE-1068 and attached the
patch files.
I don't know if and how you can deprecate a JFlex grammar though.

On Nov 27, 2007 1:43 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> Yes, please open a JIRA issue and submit your patches.
>
> I wonder if there is anyway to deprecate functionality in a JFlex
> grammar?  That is, is there anyway we can communicate to people that
> both will be supported through 2.9 and then the correct way will be
> supported in 3.x?
>
> -Grant
>
> On Nov 27, 2007, at 2:18 AM, Shai Erera wrote:
>
> > I understand it would change the behavior of existing search
> > solutions,
> > however the current behavior is just wrong. An ACRONYM cannot be
> > ABC.DEF. If
> > you look up acronym in Wikipedia, you find only examples of I.B.M. /
> > U.S.A.
> > like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer
> > currently
> > recognizes.
> >
> > There are several ways to solve this change:
> > 1. Create a new analyzer that fixes the problem - that way,
> > applications
> > that don't want to use it will not have to, if they feel ok with the
> > current
> > behavior. However, for those who would like to get a correct behavior,
> > they'll be able to. This is not my favorite solution, but I think it
> > would
> > be preferable than simply not fixing it.
> > 2. Fix it in the new version (2.3) and specifically mention that in
> > the
> > release notes. Aren't there releases where applications need to re-
> > build the
> > index because of fundamental changes?
> >
> > Am I the only one who thinks that?
> >
> > BTW, I changed the definition in the jflex file and recompiled using
> > jflex
> > and it indeed solved the problem. It now recognizes www.abc.com. and
> > www.abc.com as hosts. I can attach the 'patch' files if you'd like to
> > compare.
> >
> > On Nov 27, 2007 9:07 AM, Chris Hostetter <[EMAIL PROTECTED]>
> > wrote:
> >
> >>
> >> : If you pass "www.abc.com", the output is (www.abc.com,
> >> 0,11,type=)
> >> : (which is correct in my opinion).
> >> : However, if you pass "www.abc.com." (notice the extra '.' at the
> >> end),
> >> the
> >> : output is (wwwabccom,0,12,type=).
> >>
> >> see also...
> >>
> >>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >>
> >>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
> >>
> >> one hitch which potentially changing this now is that it would break
> >> some searches in applications that have existing indexes built using
> >> previous versions.
> >>
> >>
> >>
> >> -Hoss
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera


[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-11-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1068:
---

Attachment: standardTokenizerImpl.patch

This is the result of re-compiling the JFlex fixed file. Not sure how useful 
this patch is, but I'm attaching it anyway.

> Invalid behavior of StandardTokenizerImpl
> -
>
> Key: LUCENE-1068
> URL: https://issues.apache.org/jira/browse/LUCENE-1068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Shai Erera
> Attachments: standardTokenizerImpl.jflex.patch, 
> standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
> Analyzer analyzer = new StandardAnalyzer();
> TokenStream ts = analyzer.tokenStream("content", new 
> StringReader(""));
> Token t;
> while ((t = ts.next()) != null) {
> System.out.println(t);
> }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) 
> (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
> output is (wwwabccom,0,12,type=).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which 
> is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form 
> A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from 
> this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM=  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
> else. I changed the definition to
> ACRONYM=  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-11-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1068:
---

Attachment: standardTokenizerImpl.jflex.patch

This fixes the JFlex definition file. The change simply replaces:
ACRONYM=  {ALPHA} "." ({ALPHA} ".")+
with
ACRONYM=  {LETTER} "." ({LETTER} ".")+

> Invalid behavior of StandardTokenizerImpl
> -
>
> Key: LUCENE-1068
> URL: https://issues.apache.org/jira/browse/LUCENE-1068
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Shai Erera
> Attachments: standardTokenizerImpl.jflex.patch, 
> standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
> Analyzer analyzer = new StandardAnalyzer();
> TokenStream ts = analyzer.tokenStream("content", new 
> StringReader(""));
> Token t;
> while ((t = ts.next()) != null) {
> System.out.println(t);
> }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) 
> (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
> output is (wwwabccom,0,12,type=).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which 
> is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form 
> A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from 
> this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM=  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
> else. I changed the definition to
> ACRONYM=  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1067:
--

Assignee: Michael McCandless

> TestStressIndexing has intermittent failures
> 
>
> Key: LUCENE-1067
> URL: https://issues.apache.org/jira/browse/LUCENE-1067
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
>
> See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:
>  OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
> [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] java.lang.NullPointerException
> [junit] at
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
> [junit] at
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
> [junit] at org.apache.lucene.index.SegmentInfos
> $FindSegmentsFile.run(SegmentInfos.java:544)
> [junit] at
> org
> .apache
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> [junit] at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
> [junit] at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $SearcherThread.doWork(TestStressIndexing.java:111)
> [junit] at org.apache.lucene.index.TestStressIndexing
> $TimedThread.run(TestStressIndexing.java:55)
> [junit] -  ---
> [junit] Testcase:
> testStressIndexAndSearching
> (org.apache.lucene.index.TestStressIndexing): FAILED
> [junit] hit unexpected exception in search1
> [junit] junit.framework.AssertionFailedError: hit unexpected
> exception in search1
> [junit] at
> org
> .apache
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
> 159)
> [junit] at
> org
> .apache
> .lucene
> .index
> .TestStressIndexing
> .testStressIndexAndSearching(TestStressIndexing.java:187)
> [junit]
> [junit]
> [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> Subsequent runs have, however passed. Has anyone else hit this on
> trunk?
> I am running using "ant clean test"
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
> how to reproduce at this point, but strikes me as a threading issue.
> Oh joy!
> I'll try to investigate more tomorrow to see if I can dream up a test
> case.
> -Grant 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

2007-11-27 Thread Shai Erera (JIRA)
Invalid behavior of StandardTokenizerImpl
-

 Key: LUCENE-1068
 URL: https://issues.apache.org/jira/browse/LUCENE-1068
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Shai Erera


The following code prints the output of StandardAnalyzer:

Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("content", new 
StringReader(""));
Token t;
while ((t = ts.next()) != null) {
System.out.println(t);
}

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) (which 
is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the 
output is (wwwabccom,0,12,type=).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is 
perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. 
and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from this 
definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM=  {ALPHA} "." ({ALPHA} ".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not something 
else. I changed the definition to
ACRONYM=  {LETTER} "." ({LETTER} ".")+
and it solved the problem.

This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential bug in StandardTokenizerImpl

2007-11-27 Thread Grant Ingersoll

Yes, please open a JIRA issue and submit your patches.

I wonder if there is anyway to deprecate functionality in a JFlex  
grammar?  That is, is there anyway we can communicate to people that  
both will be supported through 2.9 and then the correct way will be  
supported in 3.x?


-Grant

On Nov 27, 2007, at 2:18 AM, Shai Erera wrote:

I understand it would change the behavior of existing search  
solutions,
however the current behavior is just wrong. An ACRONYM cannot be  
ABC.DEF. If
you look up acronym in Wikipedia, you find only examples of I.B.M. /  
U.S.A.
like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer  
currently

recognizes.

There are several ways to solve this change:
1. Create a new analyzer that fixes the problem - that way,  
applications
that don't want to use it will not have to, if they feel ok with the  
current

behavior. However, for those who would like to get a correct behavior,
they'll be able to. This is not my favorite solution, but I think it  
would

be preferable than simply not fixing it.
2. Fix it in the new version (2.3) and specifically mention that in  
the
release notes. Aren't there releases where applications need to re- 
build the

index because of fundamental changes?

Am I the only one who thinks that?

BTW, I changed the definition in the jflex file and recompiled using  
jflex

and it indeed solved the problem. It now recognizes www.abc.com. and
www.abc.com as hosts. I can attach the 'patch' files if you'd like to
compare.

On Nov 27, 2007 9:07 AM, Chris Hostetter <[EMAIL PROTECTED]>  
wrote:




: If you pass "www.abc.com", the output is (www.abc.com, 
0,11,type=)

: (which is correct in my opinion).
: However, if you pass "www.abc.com." (notice the extra '.' at the  
end),

the
: output is (wwwabccom,0,12,type=).

see also...

http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383

http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

one hitch which potentially changing this now is that it would break
some searches in applications that have existing indexes built using
previous versions.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Regards,

Shai Erera


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1067) TestStressIndexing has intermittent failures

2007-11-27 Thread Grant Ingersoll (JIRA)
TestStressIndexing has intermittent failures


 Key: LUCENE-1067
 URL: https://issues.apache.org/jira/browse/LUCENE-1067
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 2.3


See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below:

 OK, I have seen this twice in the last two days:
Testsuite: org.apache.lucene.index.TestStressIndexing
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
sec
[junit]
[junit] - Standard Output ---
[junit] java.lang.NullPointerException
[junit] at
org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
[junit] at
org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
[junit] at org.apache.lucene.index.SegmentInfos
$FindSegmentsFile.run(SegmentInfos.java:544)
[junit] at
org
.apache
.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
[junit] at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
[junit] at org.apache.lucene.index.TestStressIndexing
$SearcherThread.doWork(TestStressIndexing.java:111)
[junit] at org.apache.lucene.index.TestStressIndexing
$TimedThread.run(TestStressIndexing.java:55)
[junit] -  ---
[junit] Testcase:
testStressIndexAndSearching
(org.apache.lucene.index.TestStressIndexing): FAILED
[junit] hit unexpected exception in search1
[junit] junit.framework.AssertionFailedError: hit unexpected
exception in search1
[junit] at
org
.apache
.lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java:
159)
[junit] at
org
.apache
.lucene
.index
.TestStressIndexing
.testStressIndexAndSearching(TestStressIndexing.java:187)
[junit]
[junit]
[junit] Test org.apache.lucene.index.TestStressIndexing FAILED

Subsequent runs have, however passed. Has anyone else hit this on
trunk?

I am running using "ant clean test"

I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure
how to reproduce at this point, but strikes me as a threading issue.
Oh joy!

I'll try to investigate more tomorrow to see if I can dream up a test
case.

-Grant 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Occasional failure in TestStressIndexing.java

2007-11-27 Thread Grant Ingersoll
I opened https://issues.apache.org/jira/browse/LUCENE-1067 to track  
the issue.


On Nov 27, 2007, at 6:10 AM, Michael McCandless wrote:



OK I just ran the test 5 times, also on quad Mac Pro, and got the  
error to occur as well!


Ugh.

I will track it down.

Mike

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:

OK, I have seen this twice in the last two days:
Testsuite: org.apache.lucene.index.TestStressIndexing
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58
sec
[junit]
[junit] - Standard Output ---
[junit] java.lang.NullPointerException
[junit] at
org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java: 
67)

[junit] at
org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
[junit] at org.apache.lucene.index.SegmentInfos
$FindSegmentsFile.run(SegmentInfos.java:544)
[junit] at
org
.apache
.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
[junit] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
[junit] at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
[junit] at org.apache.lucene.index.TestStressIndexing
$SearcherThread.doWork(TestStressIndexing.java:111)
[junit] at org.apache.lucene.index.TestStressIndexing
$TimedThread.run(TestStressIndexing.java:55)
[junit] -  ---
[junit] Testcase:
testStressIndexAndSearching
(org.apache.lucene.index.TestStressIndexing):  FAILED
[junit] hit unexpected exception in search1
[junit] junit.framework.AssertionFailedError: hit unexpected
exception in search1
[junit] at
org
.apache
.lucene 
.index.TestStressIndexing.runStressTest(TestStressIndexing.java:

159)
[junit] at
org
.apache
.lucene
.index
.TestStressIndexing
.testStressIndexAndSearching(TestStressIndexing.java:187)
[junit]
[junit]
[junit] Test org.apache.lucene.index.TestStressIndexing FAILED

Subsequent runs have, however passed.  Has anyone else hit this on
trunk?

I am running using "ant clean test"

I'm on a Mac Pro 4 core, 4GB machine, if that helps at all.  Not sure
how to reproduce at this point, but strikes me as a threading issue.
Oh joy!

I'll try to investigate more tomorrow to see if I can dream up a test
case.

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Occasional failure in TestStressIndexing.java

2007-11-27 Thread Michael McCandless

OK I just ran the test 5 times, also on quad Mac Pro, and got the error to 
occur as well!

Ugh.

I will track it down.

Mike

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> OK, I have seen this twice in the last two days:
> Testsuite: org.apache.lucene.index.TestStressIndexing
>  [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58  
> sec
>  [junit]
>  [junit] - Standard Output ---
>  [junit] java.lang.NullPointerException
>  [junit] at  
> org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67)
>  [junit] at  
> org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66)
>  [junit] at org.apache.lucene.index.SegmentInfos 
> $FindSegmentsFile.run(SegmentInfos.java:544)
>  [junit] at  
> org 
> .apache 
> .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
>  [junit] at  
> org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
>  [junit] at  
> org.apache.lucene.index.IndexReader.open(IndexReader.java:192)
>  [junit] at  
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56)
>  [junit] at org.apache.lucene.index.TestStressIndexing 
> $SearcherThread.doWork(TestStressIndexing.java:111)
>  [junit] at org.apache.lucene.index.TestStressIndexing 
> $TimedThread.run(TestStressIndexing.java:55)
>  [junit] -  ---
>  [junit] Testcase:  
> testStressIndexAndSearching 
> (org.apache.lucene.index.TestStressIndexing):  FAILED
>  [junit] hit unexpected exception in search1
>  [junit] junit.framework.AssertionFailedError: hit unexpected  
> exception in search1
>  [junit] at  
> org 
> .apache 
> .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: 
> 159)
>  [junit] at  
> org 
> .apache 
> .lucene 
> .index 
> .TestStressIndexing 
> .testStressIndexAndSearching(TestStressIndexing.java:187)
>  [junit]
>  [junit]
>  [junit] Test org.apache.lucene.index.TestStressIndexing FAILED
> 
> Subsequent runs have, however passed.  Has anyone else hit this on  
> trunk?
> 
> I am running using "ant clean test"
> 
> I'm on a Mac Pro 4 core, 4GB machine, if that helps at all.  Not sure  
> how to reproduce at this point, but strikes me as a threading issue.   
> Oh joy!
> 
> I'll try to investigate more tomorrow to see if I can dream up a test  
> case.
> 
> -Grant
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-27 Thread Doron Cohen
"Doug Cutting (JIRA)" <[EMAIL PROTECTED]> wrote on 26/11/2007 20:14:43:

> > I found out however that delaying the syncs (but intending tosync) also
> means keeping the file handles open [...]
>
> Not necessarily.  You could just queue the file names for sync,
> close them, and then have the background thread open, sync and
> close them.  The close could trigger the OS to sync things
> faster in the background.  Then the open/sync/close could
> mostly be a no-op.  Might be worth a try.

Good point. Actually even with a background thread we must
use file-names, because otherwise there's no control over
the number of open file handles.

In addition, my tests on XP indicated that this way many syncs
were no-ops - i.e. close() and later open+sync+close was
faster than flush() and later sync+close.

On both XP and Linux, a background thread was faster than
a sync-at-end.

Some numbers (no-sync, immediate-sync, at-end, background):
100 files of 10K,
  Linux: 5.7, 5.8, 6.4, 5.9
 XP: 6.6, 11.1, 7.7, 6.8
1,000 files of 1K
  Linux: 5.8, 13.8, 11.2, 6.0
 XP: 8.1, 44.5, 19.2, 15.0
10,000 files of 100 chars
  Linux: 7.0, 89.9, 68.0, 60.3

So, as much as I am not happy about adding a thread, it seems
to be faster, at least for this synthetic test.. I'm curious to
see Mike's actual Lucene numbers.

In any case we should not sync files saved during non-commit writes.
Theses are most writes for large indexes with AutoCommit=false.

Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]