date:20080408

Re: Sort difference between 2.1 and 2.3

2008-04-08 Thread Michael McCandless

You're right, Lucene changed wrt the 0x character: 2.3 now uses
this character internally as an end of term marker when storing term
text.

This was done as part of LUCENE-843 (speeding up indexing).

Technically that character is an invalid UTF16 character (for
interchange), but it looks like a few Lucene users were indeed relying
on older Lucene versions accepting  preserving it.

You could use 0xfffe instead?  Lucene 2.3 will preserve it, though
It's also invalid for interchange (so future Lucene versions might
change wrt that, too).

Or ... it looks like you're use case is to sort all last values
after all first values?  In which case one way to do this (without
using invalid UTF16 characters) might be to add a new field marking
whether you have a last or a first value, then sort first by that
field and second by your value field?

Mike

Antony Bowesman [EMAIL PROTECTED] wrote:
 Hi,

  I had a test case that added two documents, each with one untokenized
 field, and sorted them.  The data in each document was

  char(1) + First
  char(0x) + Last

  With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1,
 they are not.  Looking at the index with Luke shows that the document with
 Last has not been handled correctly, i.e. the text for the subject field
 is empty.

  The test case below shows the problem.

  Regards
  Antony


  import static org.junit.Assert.assertEquals;
  import static org.junit.Assert.assertTrue;

  import java.io.IOException;

  import org.apache.lucene.analysis.standard.StandardAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.document.Field;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.search.Hits;
  import org.apache.lucene.search.IndexSearcher;
  import org.apache.lucene.search.MatchAllDocsQuery;
  import org.apache.lucene.search.Query;
  import org.apache.lucene.search.Sort;
  import org.apache.lucene.search.SortField;
  import org.junit.After;
  import org.junit.Before;
  import org.junit.Test;

  public class LastSubjectTest
  {
 /**
  *  Set up a number of documents with 1 duplicate ContentId
  *  @throws Exception
  */
 @Before
 public void setUp() throws Exception
 {
 IndexWriter writer = new IndexWriter(TestDir/, new
 StandardAnalyzer(), true);
 Document doc = new Document();
 String subject = new StringBuffer(1).append((char)0x).toString()
 + Last;
 Field f = new Field(subject, subject, Field.Store.YES,
 Field.Index.NO_NORMS);
 doc.add(f);
 writer.addDocument(doc);
 doc = new Document();
 subject = new StringBuffer(1).append((char)0x1).toString() +
 First;
 f = new Field(subject, subject, Field.Store.YES,
 Field.Index.NO_NORMS);
 doc.add(f);
 writer.addDocument(doc);
 writer.close();
 }

 /**
  *  @throws Exception
  */
 @After
 public void tearDown() throws Exception
 {
 }

 /**
  *  Tests that the last is after first document, sorted by subject
  *  @throws IOException
  */
 @Test
 public void testSortDateAscending()
throws IOException
 {
 IndexSearcher searcher = new IndexSearcher(TestDir/);
 Query q = new MatchAllDocsQuery();
 Sort sort = new Sort(new SortField(subject));
 Hits hits = searcher.search(q, sort);
 assertEquals(Hits should match all documents,
 searcher.getIndexReader().maxDoc(), hits.length());

 Document fd = hits.doc(0);
 Document ld = hits.doc(1);
 String fs = fd.get(subject);
 String ls = ld.get(subject);

 for (int i = 0; i  hits.length(); i++)
 {
 Document doc = hits.doc(i);
 String subject = doc.get(subject);
 System.out.println(Subject: + subject);
 }
 assertTrue(Subjects have been sorted incorrectly, fs.compareTo(ls)
  0);
 }

  }


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Michael McCandless



Unfortunately, we lost the StandardTokenizerConstants interface as
part of this:

https://issues.apache.org/jira/browse/LUCENE-966

which was a speedup to StandardTokenizer by switching to JFlex instead
of JavaCC.

But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie, ALPHANUM,
APOSTROPHE, etc.).  Does that work?

Mike

Antony Bowesman wrote:
I'm migrating from 2.1 to 2.3 and found that the public interface  
StandardTokenizerConstants has gone.  It looks like the definitions  
have disappeared inside the package private class  
StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns  
values from Token.type().


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sort difference between 2.1 and 2.3

2008-04-08 Thread Antony Bowesman

Thanks for the explanation Mike.  It's not a big issue, it's just a test case 
where I was needed to ensure ordering for the test, so I'll just use a valid 
high utf-16 character.  It just seemed odd that the field was showing strangely 
in Luke.  Your explanation gives the reason, thanks.


Antony



Michael McCandless wrote:

You're right, Lucene changed wrt the 0x character: 2.3 now uses
this character internally as an end of term marker when storing term
text.

This was done as part of LUCENE-843 (speeding up indexing).

Technically that character is an invalid UTF16 character (for
interchange), but it looks like a few Lucene users were indeed relying
on older Lucene versions accepting  preserving it.

You could use 0xfffe instead?  Lucene 2.3 will preserve it, though
It's also invalid for interchange (so future Lucene versions might
change wrt that, too).

Or ... it looks like you're use case is to sort all last values
after all first values?  In which case one way to do this (without
using invalid UTF16 characters) might be to add a new field marking
whether you have a last or a first value, then sort first by that
field and second by your value field?

Mike

Antony Bowesman [EMAIL PROTECTED] wrote:

Hi,

 I had a test case that added two documents, each with one untokenized
field, and sorted them.  The data in each document was

 char(1) + First
 char(0x) + Last

 With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1,
they are not.  Looking at the index with Luke shows that the document with
Last has not been handled correctly, i.e. the text for the subject field
is empty.

 The test case below shows the problem.

 Regards
 Antony


 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;

 import java.io.IOException;

 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.search.Hits;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.MatchAllDocsQuery;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.Sort;
 import org.apache.lucene.search.SortField;
 import org.junit.After;
 import org.junit.Before;
 import org.junit.Test;

 public class LastSubjectTest
 {
/**
 *  Set up a number of documents with 1 duplicate ContentId
 *  @throws Exception
 */
@Before
public void setUp() throws Exception
{
IndexWriter writer = new IndexWriter(TestDir/, new
StandardAnalyzer(), true);
Document doc = new Document();
String subject = new StringBuffer(1).append((char)0x).toString()
+ Last;
Field f = new Field(subject, subject, Field.Store.YES,
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
doc = new Document();
subject = new StringBuffer(1).append((char)0x1).toString() +
First;
f = new Field(subject, subject, Field.Store.YES,
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
writer.close();
}

/**
 *  @throws Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Tests that the last is after first document, sorted by subject
 *  @throws IOException
 */
@Test
public void testSortDateAscending()
   throws IOException
{
IndexSearcher searcher = new IndexSearcher(TestDir/);
Query q = new MatchAllDocsQuery();
Sort sort = new Sort(new SortField(subject));
Hits hits = searcher.search(q, sort);
assertEquals(Hits should match all documents,
searcher.getIndexReader().maxDoc(), hits.length());

Document fd = hits.doc(0);
Document ld = hits.doc(1);
String fs = fd.get(subject);
String ls = ld.get(subject);

for (int i = 0; i  hits.length(); i++)
{
Document doc = hits.doc(i);
String subject = doc.get(subject);
System.out.println(Subject: + subject);
}
assertTrue(Subjects have been sorted incorrectly, fs.compareTo(ls)
 0);
}

 }


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Pooling of posting objects in DocumentsWriter

2008-04-08 Thread Michael Busch


Hi,

this is most likely a question for Mike. I'm trying to figure out what 
changes we need to make in order to support flexible indexing and 
LUCENE-1231. Currently I'm looking into the DocumentsWriter.


If we want to support different posting lists, then we probably want to 
change the Posting class to be an abstract base class and have different 
subclasses that implement the different posting formats.
The DocumentsWriter does pooling of the Posting instances and I'm 
wondering how much this improves performance. It will be a bit harder to 
do pooling with different Posting implementations. Probably we would 
need a MapClass, PostingPool with one entry for each Posting subclass 
used?
I wonder if that's worth it, because I was thinking that pooling of 
small objects in modern JVMs is not really more efficient anymore?


-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1261) Impossible to use custom norm encoding/decoding

2008-04-08 Thread John Adams (JIRA)

Impossible to use custom norm encoding/decoding
---

 Key: LUCENE-1261
 URL: https://issues.apache.org/jira/browse/LUCENE-1261
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3.1
 Environment: All
Reporter: John Adams


Although it is possible to override methods encodeNorm and decodeNorm in a 
custom Similarity class, these methods are not actually used by the query 
processing and scoring functions, not by the indexing functions. The relevant 
Lucene classes all call Similarity.decodeNorm rather than 
similarity.decodeNorm, i.e. the norm encoding/decoding is fixed to use that 
of the base Similarity class. Also index writing classes such as DocumentWriter 
use Similarity.decodeNorm rather than similarity.decodeNorm, so we are 
stuck with the 3 bit mantissa encoding implemented by SmallFloat.floatToByte315 
and SmallFloat.byte315ToFloat.

This is very restrictive and annoying, since in practice many users would 
prefer an encoding that allows finer distinctions for boost and normalisation 
factors close to 1.0. For example. SmallFloat.floatToByte52 uses 5 bits of 
mantissa, and this would be of great help in distinguishing much better between 
subtly different lengthNorms and FieldBoost/DocumentBoost values.

It hsould be easy to fix this by changing all instances of 
Similarity.decodeNorm and Similarity.encodeNorm to similarity.decodeNorm 
and similarity.encodeNorm in the Lucene code (there are only a few of each).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1261) Impossible to use custom norm encoding/decoding

2008-04-08 Thread Karl Wettin (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12586880#action_12586880
]

Karl Wettin commented on LUCENE-1261:
-

Hi John,

see LUCENE-1260

karl

Impossible to use custom norm encoding/decoding
---

Key: LUCENE-1261
URL: https://issues.apache.org/jira/browse/LUCENE-1261
Project: Lucene - Java
Issue Type: Bug
Components: Query/Scoring
Affects Versions: 2.3.1
Environment: All
Reporter: John Adams

Although it is possible to override methods encodeNorm and decodeNorm in a
custom Similarity class, these methods are not actually used by the query
processing and scoring functions, not by the indexing functions. The relevant
Lucene classes all call Similarity.decodeNorm rather than
similarity.decodeNorm, i.e. the norm encoding/decoding is fixed to use that
of the base Similarity class. Also index writing classes such as
DocumentWriter use Similarity.decodeNorm rather than
similarity.decodeNorm, so we are stuck with the 3 bit mantissa encoding
implemented by SmallFloat.floatToByte315 and SmallFloat.byte315ToFloat.
This is very restrictive and annoying, since in practice many users would
prefer an encoding that allows finer distinctions for boost and normalisation
factors close to 1.0. For example. SmallFloat.floatToByte52 uses 5 bits of
mantissa, and this would be of great help in distinguishing much better
between subtly different lengthNorms and FieldBoost/DocumentBoost values.
It hsould be easy to fix this by changing all instances of
Similarity.decodeNorm and Similarity.encodeNorm to
similarity.decodeNorm and similarity.encodeNorm in the Lucene code (there
are only a few of each).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

2008-04-08 Thread Michael McCandless


Hi Michael,

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.

I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text - abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM byte slices, and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
inverted terms API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.

Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...

The DocumentsWriter does pooling of the Posting instances and I'm  
wondering how much this improves performance.


We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a newPosting()  method.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: shingles and punctuations

2008-04-08 Thread Mathieu Lecarme


setting a flag in a filter is easy :

8---

package org.apache.lucene.analysis.shingle;

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
 * @author Mathieu Lecarme
 *
 */
public class SentenceCutterFilter extends TokenFilter{
  public static final int FLAG = 42;
  public Token previous = null;

  protected SentenceCutterFilter(TokenStream input) {
super(input);
  }

  public Token next() throws IOException {
Token current = input.next();
if(current == null)
  return null;
if(previous == null || (current.startOffset() -  
previous.endOffset())  1)

  current.setFlags(FLAG);
previous = current;
return current;
  }
}

8---
and using it at the right place is tricky :
8---

String test = This is a test, a big test;
TokenStream stream =
  new StopFilter(
new ShingleFilter(
  new SentenceCutterFilter(
new LowerCaseFilter(
  new ISOLatin1AccentFilter(
  new StandardTokenizer(new StringReader(test), 3),
  new String[]{is, a});

8---

But I must be to tired, but I can't patch the ShingleFilter to handle  
the flag.

I guess flag should be a bit, tested with a mask.

M.



Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit :
For now, it's up to your app to know, unfortunately :-(  I think the  
WikipediaTokenizer is the only one using flags currently in the  
Lucene.



On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:

I'll use Token flags to specifiy first token in a sentence, but how  
it's works? how flag collision is avoided? to keep it simple, i'll  
take 1 as flag, but what happens if an other filter use the same  
flags?


M.

Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
I think you need sentence detection to take place further  
upstream.  Then you could use the Token type or Token flags to  
indicate punctuation, sentences, whatever and we could patch the  
shingle filter to ignore these things, or break and move onto the  
next one.


-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

The newly ShingleFilter is very helpful to fetch group of words,  
but it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there  
is more than one char with the previous token, it should be  
punctation (or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-08 Thread Hoss Man (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12586954#action_12586954
]

Hoss Man commented on LUCENE-1260:
--

bq. I haven't thought too much about it yet, but it seems to me that norm codec
has more to do with the physical store (Directory) than Similarity and should
perhaps be moved there instead?

As long as the norm remains a fixed size (1 byte) then it doesn't really matter
whether it's tied to Similarity's or the store itself -- it would be nice if
the Index could tell you which normDecoder to use, but it's not any more
unreasonable to expect the application to keep track of this (if it's not the
default encoding) since applications already have to keep track of things like
which Analyzer is compatible with querying this index.

If we want norms to be more flexible, so tat apps can pick not only the
encoding but also the size... then things get more interesting, but it's still
feasible to say if you customize this, you have to make your reading apps and
your writing apps smart enough to know about your customization.

bq. I also want to move it to the instance scope so I can have multiple indices
with unique norm span/resolutions created from the same classloader.

I agree, it's a good goal.

Norm codec strategy in Similarity
-

Key: LUCENE-1260
URL: https://issues.apache.org/jira/browse/LUCENE-1260
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Attachments: LUCENE-1260.txt

The static span and resolution of the 8 bit norms codec might not fit with
all applications.
My use case requires that 100f-250f is discretized in 60 bags instead of the
default.. 10?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Antony Bowesman


But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie, ALPHANUM,
APOSTROPHE, etc.).  Does that work?


Problem as mentioned below is that the StandardTokenizerImpl.java is package 
private and even though the ints and string array are declared as public static, 
they are not visible.


Antony




Mike

Antony Bowesman wrote:
I'm migrating from 2.1 to 2.3 and found that the public interface 
StandardTokenizerConstants has gone.  It looks like the definitions 
have disappeared inside the package private class StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns 
values from Token.type().


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Michael McCandless



But, StandardTokenizer is public?  It exports those constants for you?

Mike

Antony Bowesman wrote:

But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie,  
ALPHANUM,

APOSTROPHE, etc.).  Does that work?


Problem as mentioned below is that the StandardTokenizerImpl.java  
is package private and even though the ints and string array are  
declared as public static, they are not visible.


Antony



Mike
Antony Bowesman wrote:
I'm migrating from 2.1 to 2.3 and found that the public interface  
StandardTokenizerConstants has gone.  It looks like the  
definitions have disappeared inside the package private class  
StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns  
values from Token.type().


Antony


 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Antony Bowesman


But, StandardTokenizer is public?  It exports those constants for you?


Really?  Sorry, but I can't find them - in 2.3.1 sources, there are no 
references to those statics.  Javadocs have no reference to them in 
StandardTokenizer


http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

and I can't see ALPHANUM in the Javadoc index.  Eclipse cannot resolve them.

Am I missing something?
Antony




Mike

Antony Bowesman wrote:

But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie, ALPHANUM,
APOSTROPHE, etc.).  Does that work?


Problem as mentioned below is that the StandardTokenizerImpl.java is 
package private and even though the ints and string array are declared 
as public static, they are not visible.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Optimise Indexing time using lucene..

2008-04-08 Thread lucene4varma


Hi all,

I am new to lucene and am using it for text search in my web application,
and for that i need to index records in database.
We are using jdbc directory to store the indexes. Now the problem is when is
start the process of indexing the records for the first time it is taking
huge amount of time. Following is the code for indexing. 

rs = st.executequery(); // returns 2 million records
while(rs.next()) {
create java object .;
index java record into JDBC directory...;
}

The above process takes me huge amount of time for 2 million records.
Approximately it is taking 3-4 business days to run the process. 
Can any one please suggest me and approach by which i could cut down this
time.

Thanks in advance,
Varma
-- 
View this message in context: 
http://www.nabble.com/Optimise-Indexing-time-using-lucene..-tp16575115p16575115.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-08 Thread robert engels


That is opposite of my testing:...

The 'foreach' is consistently faster. The time difference is  
independent of the size of the array. What I know about JVM  
implementations, the foreach version SHOULD always be faster -  
because the no bounds checking needs to be done on the element access...


Times for the client JVM under 1.5_13

N = 10
indexed time = 14
foreach time = 8
N = 100
indexed time = 90
foreach time = 75
N = 1000
indexed time = 875
foreach time = 732
N = 1
indexed time = 8801
foreach time = 7552
N = 10
indexed time = 88566
foreach time = 75974

Times for the server JVM under 1.5_13

N = 10
indexed time = 21
foreach time = 21
N = 100
indexed time = 85
foreach time = 32
N = 1000
indexed time = 347
foreach time = 303
N = 1
indexed time = 3472
foreach time = 3017
N = 10
indexed time = 34158
foreach time = 30133

package test;

import junit.framework.TestCase;

public class LoopTest extends TestCase {
public void testLoops() {

int I = 10;
int N = 10;

for (int factor = 0; factor  5; factor++) {
String[] strings = new String[N];

for (int i = 0; i  N; i++) {
strings[i] = some string;
}

System.out.println(N =  + N);

long len = 0;
long start = System.currentTimeMillis();

for (int i = 0; i  I; i++) {
for (int j = 0; j  N; j++) {
len += strings[j].length();
}
}

	System.out.println(indexed time =  + (System.currentTimeMillis 
() - start));


len = 0;
start = System.currentTimeMillis();
for (int i = 0; i  I; i++) {
for (String s : strings) {
len += s.length();
}
}
	System.out.println(foreach time =  + (System.currentTimeMillis 
() - start));

N *= 10;
}
}

}

Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-08 Thread Yonik Seeley

On Tue, Apr 8, 2008 at 7:48 PM, robert engels [EMAIL PROTECTED] wrote:
 That is opposite of my testing:...

  The 'foreach' is consistently faster.

It's consistently slower for me (I tested java5 and java6 both with
-server on a P4).
I'm a big fan of testing different methods in different test runs
(because of hotspot, gc, etc).

Example results:
$ c:/opt/jdk16/bin/java -server t 1 10 foreach
N = 10
method=foreachlen=10 indexed time = 8734

[EMAIL PROTECTED] /cygdrive/h/tmp
$ c:/opt/jdk16/bin/java -server t 1 10 iter
N = 10
method=iterlen=10 indexed time = 7062


Here's my test code (a modified version of yours):

public class t {
   public static void main(String[] args) {
   int I = Integer.parseInt(args[0]); // 100
   int N = Integer.parseInt(args[1]); // 10
   String method = args[2].intern();  // foreach or iter

   String[] strings = new String[N];

   for (int i = 0; i  N; i++) {
   strings[i] = Integer.toString(i);
   }

   System.out.println(N =  + N);

   long len = 0;
   long start = System.currentTimeMillis();

   if (method==foreach)
 for (int i = 0; i  I; i++) {
 for (String s : strings) {
 len += s.length();
 }
 }
   else
 for (int i = 0; i  I; i++) {
 for (int j = 0; j  N; j++) {
 len += strings[j].length();
 }
 }

   System.out.println(method=+method + len=+len+ indexed
time =  + (System.currentTimeMillis() - start));
 }
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index

2008-04-08 Thread Trejkaz (JIRA)

NullPointerException from FieldsReader after problem reading the index
--

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.1
Reporter: Trejkaz


There is a situation where there is an IOException reading from Hits, and then 
the next time you get a NullPointerException instead of an IOException.

Example stack traces:

java.io.IOException: The specified network name is no longer available
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at 
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
at org.apache.lucene.search.Hits.doc(Hits.java:104)

That error is fine.  The problem is the next call to doc generates:

java.lang.NullPointerException
at 
org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
at org.apache.lucene.search.Hits.doc(Hits.java:104)

Presumably FieldsReader is caching partially-initialised data somewhere.  I 
would normally expect the exact same IOException to be thrown for subsequent 
calls to the method.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-08 Thread Yonik Seeley

foreach vs explicit loop counter is pretty academic for Lucene anyway I think.
I can't think of any inner loops where it would really matter.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Chris Hostetter


:  But, StandardTokenizer is public?  It exports those constants for you?
: 
: Really?  Sorry, but I can't find them - in 2.3.1 sources, there are no
: references to those statics.  Javadocs have no reference to them in
: StandardTokenizer

I think Michael is forgetting that he re-added those constants to the 
trunk after 2.3.1 was released...

https://issues.apache.org/jira/browse/LUCENE-1150



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Help migrating from 1.9.1 to 2.3.0 (Newbie)

2008-04-08 Thread Chris Hostetter


There is a FAQ covering this question...

http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b75d4faa9664ef6cf4d

start by getting your code to compile against 1.9.1 without any 
deprecation warnings.  The deprecation messages in the 1.9.1 javadocs will 
tell you which new method to use.

once you have no deprecation warnings with 1.9.1, your code *should* 
compile with 2.3.X just fine.

Incidently, please note the following information if you have followup 
questions on this topic...

Please Use [EMAIL PROTECTED] Not [EMAIL PROTECTED]

Your question is better suited for the [EMAIL PROTECTED] mailing list ...
not the [EMAIL PROTECTED] list.  java-dev is for discussing development of
the internals of the Lucene Java library ... it is *not* the appropriate
place to ask questions about how to use the Lucene Java library when
developing your own applications.  Please resend your message to
the java-user mailing list, where you are likely to get more/better
responses since that list also has a larger number of subscribers.

http://people.apache.org/~hossman/#java-dev



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sort difference between 2.1 and 2.3

Re: StandardTokenizerConstants in 2.3

Re: Sort difference between 2.1 and 2.3

Pooling of posting objects in DocumentsWriter

[jira] Created: (LUCENE-1261) Impossible to use custom norm encoding/decoding

[jira] Commented: (LUCENE-1261) Impossible to use custom norm encoding/decoding

Re: Pooling of posting objects in DocumentsWriter

Re: shingles and punctuations

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

Re: StandardTokenizerConstants in 2.3

Re: StandardTokenizerConstants in 2.3

Re: StandardTokenizerConstants in 2.3

Optimise Indexing time using lucene..

Re: [jira] Created: (LUCENE-1257) Port to Java5

Re: [jira] Created: (LUCENE-1257) Port to Java5

[jira] Created: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index

Re: [jira] Created: (LUCENE-1257) Port to Java5

Re: StandardTokenizerConstants in 2.3

Re: Help migrating from 1.9.1 to 2.3.0 (Newbie)

19 matches

Site Navigation

Mail list logo

Footer information