Help migrating from 1.9.1 to 2.3.0 (Newbie)
Someone has left here and I was asked to upgrade to lucene 2.3.0. Please comment on the compile errors: ERROR: rd.delete(t) > does this become: rd.deleteDocuments(t) ? (rd is an IndexReader) ERROR: Query q = MultiFieldQueryParser.parse(srch, names, new StandardAnalyzer()) Do I change this to an array of strings like so: Query q = MultiFieldQueryParser.parse(new String[] {srch}, names, new StandardAnalyzer()) ?? ERROR: Field.Keyword(obj.getKeyFieldName(), obj.getKeyFieldValue()) Both are strings, what does this become? ERROR: Field.Text(field.getName(), dispVal) Both are strings, what does this become? Thank you. -- View this message in context: http://www.nabble.com/Help-migrating-from-1.9.1-to-2.3.0-%28Newbie%29-tp16547527p16547527.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sort difference between 2.1 and 2.3
Hi, I had a test case that added two documents, each with one untokenized field, and sorted them. The data in each document was char(1) + "First" char(0x) + "Last" With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1, they are not. Looking at the index with Luke shows that the document with "Last" has not been handled correctly, i.e. the text for the "subject" field is empty. The test case below shows the problem. Regards Antony import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.junit.After; import org.junit.Before; import org.junit.Test; public class LastSubjectTest { /** * Set up a number of documents with 1 duplicate ContentId * @throws Exception */ @Before public void setUp() throws Exception { IndexWriter writer = new IndexWriter("TestDir/", new StandardAnalyzer(), true); Document doc = new Document(); String subject = new StringBuffer(1).append((char)0x).toString() + "Last"; Field f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); doc = new Document(); subject = new StringBuffer(1).append((char)0x1).toString() + "First"; f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); writer.close(); } /** * @throws Exception */ @After public void tearDown() throws Exception { } /** * Tests that the last is after first document, sorted by subject * @throws IOException */ @Test public void testSortDateAscending() throws IOException { IndexSearcher searcher = new IndexSearcher("TestDir/"); Query q = new MatchAllDocsQuery(); Sort sort = new Sort(new SortField("subject")); Hits hits = searcher.search(q, sort); assertEquals("Hits should match all documents", searcher.getIndexReader().maxDoc(), hits.length()); Document fd = hits.doc(0); Document ld = hits.doc(1); String fs = fd.get("subject"); String ls = ld.get("subject"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); String subject = doc.get("subject"); System.out.println("Subject:" + subject); } assertTrue("Subjects have been sorted incorrectly", fs.compareTo(ls) < 0); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardTokenizerConstants in 2.3
I'm migrating from 2.1 to 2.3 and found that the public interface StandardTokenizerConstants has gone. It looks like the definitions have disappeared inside the package private class StandardTokenizerImpl. Was this intentional? I was using these to determine the returns values from Token.type(). Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586588#action_12586588 ] Karl Wettin commented on LUCENE-1260: - I suppose it would be possible to implement a NormCodec that would listen to encodeNorm(float) while one is creating a subset of the index in order to find all norm resolution sweetspots for that corpus using some appropriate algorithm. Mean shift?. Perhaps it even would be possible to compress it down to n bags from the start and then allow for it to grow in case new documents with other norm requirements are added to the store. I haven't thought too much about it yet, but it seems to me that norm codec has more to do with the physical store (Directory) than Similarity and should perhaps be moved there instead? I have no idea how, but I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader. > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin > Attachments: LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1260: Attachment: LUCENE-1260.txt * Simlarity#getNormCodec() * Simlarity#setNormCodec(NormCodec) * Similarity$NormCodec * Similarity$DefaultNormCodec * Similarity$SimpleNormCodec (binsearches over a sorted float[]) I also depricated Similarity#getNormsTable() and replaced the only use I could find of it - in TermScorer. Could not spont any problems with performance or anything with that. > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin > Attachments: LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1260) Norm codec strategy in Similarity
Norm codec strategy in Similarity - Key: LUCENE-1260 URL: https://issues.apache.org/jira/browse/LUCENE-1260 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.1 Reporter: Karl Wettin The static span and resolution of the 8 bit norms codec might not fit with all applications. My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Dataset to test lucene
hi, i need a dataset having html pages to test my lucene programs... from where can i download it.. -- View this message in context: http://www.nabble.com/Dataset-to-test-lucene-tp16538138p16538138.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1259) Token.clone() copies termBuffer - unneccessary in most cases
Token.clone() copies termBuffer - unneccessary in most cases Key: LUCENE-1259 URL: https://issues.apache.org/jira/browse/LUCENE-1259 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Thomas Peuss Priority: Minor Attachments: LUCENE-1259.patch The method Token.clone() copies the termBuffer. This is OK for the _clone()_-method (it works according to what we expect from _clone()_). But in most cases the termBuffer is set directly after cloning. This is an unnecessary copy step we can avoid. This patch adds a new method called _cloneWithoutTermBuffer()_. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1259) Token.clone() copies termBuffer - unneccessary in most cases
[ https://issues.apache.org/jira/browse/LUCENE-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Peuss updated LUCENE-1259: - Attachment: LUCENE-1259.patch > Token.clone() copies termBuffer - unneccessary in most cases > > > Key: LUCENE-1259 > URL: https://issues.apache.org/jira/browse/LUCENE-1259 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Thomas Peuss >Priority: Minor > Attachments: LUCENE-1259.patch > > > The method Token.clone() copies the termBuffer. This is OK for the > _clone()_-method (it works according to what we expect from _clone()_). But > in most cases the termBuffer is set directly after cloning. This is an > unnecessary copy step we can avoid. This patch adds a new method called > _cloneWithoutTermBuffer()_. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: shingles and punctuations
Hi Mathieu, >From the class comment for ShingleFilter: This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0. You could use feature this by setting (in an upstream filter) the positionIncrement of each sentence-starting word be at least as large as the maximum shingle size. This would result in sentence-ending shingles like ". _" and sentence-beginning shingles like "_ Word". Steve On 04/06/2008 at 1:23 PM, Mathieu Lecarme wrote: > The newly ShingleFilter is very helpful to fetch group of words, but > it doesn't handle ponctuation or any separation. > If you feed it with multiple sentences, you will get shingle that > start in one sentences and end in the next. > In order to avoid that, you can handle token positions, if there is > more than one char with the previous token, it should be punctation > (or typo). > Any suggestions to handle only shingle in the same sentence? > > M. > > - To > unsubscribe, e-mail: [EMAIL PROTECTED] For > additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]