Hudson build is back to normal: Lucene-trunk #428

2008-04-07 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/428/changes



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: shingles and punctuations

2008-04-07 Thread Steven A Rowe
Hi Mathieu,

From the class comment for ShingleFilter:

  This filter handles position increments  1 by inserting
  filler tokens (tokens with termtext _). It does not
  handle a position increment of 0.

You could use feature this by setting (in an upstream filter) the 
positionIncrement of each sentence-starting word be at least as large as the 
maximum shingle size.  This would result in sentence-ending shingles like . _ 
and sentence-beginning shingles like _ Word.

Steve

On 04/06/2008 at 1:23 PM, Mathieu Lecarme wrote:
 The newly ShingleFilter is very helpful to fetch group of words, but
 it doesn't handle ponctuation or any separation.
 If you feed it with multiple sentences, you will get shingle that
 start in one sentences and end in the next.
 In order to avoid that, you can handle token positions, if there is
 more than one char with the previous token, it should be punctation
 (or typo).
 Any suggestions to handle only shingle in the same sentence?
 
 M.
 
 - To
 unsubscribe, e-mail: [EMAIL PROTECTED] For
 additional commands, e-mail: [EMAIL PROTECTED]
 


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1259) Token.clone() copies termBuffer - unneccessary in most cases

2008-04-07 Thread Thomas Peuss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Peuss updated LUCENE-1259:
-

Attachment: LUCENE-1259.patch

 Token.clone() copies termBuffer - unneccessary in most cases
 

 Key: LUCENE-1259
 URL: https://issues.apache.org/jira/browse/LUCENE-1259
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Thomas Peuss
Priority: Minor
 Attachments: LUCENE-1259.patch


 The method Token.clone() copies the termBuffer. This is OK for the 
 _clone()_-method (it works according to what we expect from _clone()_). But 
 in most cases the termBuffer is set directly after cloning. This is an 
 unnecessary copy step we can avoid. This patch adds a new method called 
 _cloneWithoutTermBuffer()_.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1259) Token.clone() copies termBuffer - unneccessary in most cases

2008-04-07 Thread Thomas Peuss (JIRA)
Token.clone() copies termBuffer - unneccessary in most cases


 Key: LUCENE-1259
 URL: https://issues.apache.org/jira/browse/LUCENE-1259
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Thomas Peuss
Priority: Minor
 Attachments: LUCENE-1259.patch

The method Token.clone() copies the termBuffer. This is OK for the 
_clone()_-method (it works according to what we expect from _clone()_). But in 
most cases the termBuffer is set directly after cloning. This is an unnecessary 
copy step we can avoid. This patch adds a new method called 
_cloneWithoutTermBuffer()_.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Dataset to test lucene

2008-04-07 Thread sumittyagi

hi,
i need a dataset having html pages to test my lucene programs...
from where can i download it..
-- 
View this message in context: 
http://www.nabble.com/Dataset-to-test-lucene-tp16538138p16538138.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-07 Thread Karl Wettin (JIRA)
Norm codec strategy in Similarity
-

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin


The static span and resolution of the 8 bit norms codec might not fit with all 
applications. 

My use case requires that 100f-250f is discretized in 60 bags instead of the 
default.. 10?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-07 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1260:


Attachment: LUCENE-1260.txt


 * Simlarity#getNormCodec()
 * Simlarity#setNormCodec(NormCodec)
 * Similarity$NormCodec
 * Similarity$DefaultNormCodec
 * Similarity$SimpleNormCodec (binsearches over a sorted float[])

I also depricated Similarity#getNormsTable() and replaced the only use I could 
find of it - in TermScorer. Could not spont any problems with performance or 
anything with that.

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-07 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12586588#action_12586588
 ] 

Karl Wettin commented on LUCENE-1260:
-

I suppose it would be possible to implement a NormCodec that would listen to 
encodeNorm(float) while one is creating a subset of the index in order to find 
all norm resolution sweetspots for that corpus using some appropriate 
algorithm. Mean shift?.

Perhaps it even would be possible to compress it down to n bags from the start 
and then allow for it to grow in case new documents with other norm 
requirements are added to the store.

I haven't thought too much about it yet, but it seems to me that norm codec has 
more to do with the physical store (Directory) than Similarity and should 
perhaps be moved there instead? I have no idea how, but I also want to move it 
to the instance scope so I can have multiple indices with unique norm 
span/resolutions created from the same classloader.

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StandardTokenizerConstants in 2.3

2008-04-07 Thread Antony Bowesman
I'm migrating from 2.1 to 2.3 and found that the public interface 
StandardTokenizerConstants has gone.  It looks like the definitions have 
disappeared inside the package private class StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns values from 
Token.type().


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sort difference between 2.1 and 2.3

2008-04-07 Thread Antony Bowesman

Hi,

I had a test case that added two documents, each with one untokenized field, and 
sorted them.  The data in each document was


char(1) + First
char(0x) + Last

With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1, they 
are not.  Looking at the index with Luke shows that the document with Last has 
not been handled correctly, i.e. the text for the subject field is empty.


The test case below shows the problem.

Regards
Antony


import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class LastSubjectTest
{
/**
 *  Set up a number of documents with 1 duplicate ContentId
 *  @throws Exception
 */
@Before
public void setUp() throws Exception
{
IndexWriter writer = new IndexWriter(TestDir/, new 
StandardAnalyzer(), true);

Document doc = new Document();
String subject = new StringBuffer(1).append((char)0x).toString() + 
Last;
Field f = new Field(subject, subject, Field.Store.YES, 
Field.Index.NO_NORMS);

doc.add(f);
writer.addDocument(doc);
doc = new Document();
subject = new StringBuffer(1).append((char)0x1).toString() + First;
f = new Field(subject, subject, Field.Store.YES, 
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
writer.close();
}

/**
 *  @throws Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Tests that the last is after first document, sorted by subject
 *  @throws IOException
 */
@Test
public void testSortDateAscending()
   throws IOException
{
IndexSearcher searcher = new IndexSearcher(TestDir/);
Query q = new MatchAllDocsQuery();
Sort sort = new Sort(new SortField(subject));
Hits hits = searcher.search(q, sort);
assertEquals(Hits should match all documents, 
searcher.getIndexReader().maxDoc(), hits.length());


Document fd = hits.doc(0);
Document ld = hits.doc(1);
String fs = fd.get(subject);
String ls = ld.get(subject);

for (int i = 0; i  hits.length(); i++)
{
Document doc = hits.doc(i);
String subject = doc.get(subject);
System.out.println(Subject: + subject);
}
assertTrue(Subjects have been sorted incorrectly, fs.compareTo(ls)  
0);
}

}


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help migrating from 1.9.1 to 2.3.0 (Newbie)

2008-04-07 Thread JavaJava

Someone has left here and I was asked to upgrade to lucene 2.3.0.  Please
comment on the compile errors:

ERROR: rd.delete(t)  does this become: rd.deleteDocuments(t) ? (rd is
an IndexReader)

ERROR: Query q = MultiFieldQueryParser.parse(srch, names, new
StandardAnalyzer())

Do I change this to an array of strings like so:

Query q = MultiFieldQueryParser.parse(new String[] {srch}, names, new
StandardAnalyzer())  ??

ERROR: Field.Keyword(obj.getKeyFieldName(), obj.getKeyFieldValue())

Both are strings, what does this become?

ERROR: Field.Text(field.getName(), dispVal) 

Both are strings, what does this become?

Thank you.
-- 
View this message in context: 
http://www.nabble.com/Help-migrating-from-1.9.1-to-2.3.0-%28Newbie%29-tp16547527p16547527.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]