[jira] Resolved: (LUCENE-1371) Add Searcher.search(Query, int)
[ https://issues.apache.org/jira/browse/LUCENE-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1371. Resolution: Fixed > Add Searcher.search(Query, int) > --- > > Key: LUCENE-1371 > URL: https://issues.apache.org/jira/browse/LUCENE-1371 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > > Now that we've deprecated Hits (LUCENE-1290), I think we should add this > trivial convenience method to Searcher, which is just sugar for > Searcher.search(Query, null, int) ie null filter, returning a TopDocs. > This way there is a simple API for users to retrieve the top N results for a > Query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar
[ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627943#action_12627943 ] Michael McCandless commented on LUCENE-1126: Hmm -- I'm now seeing an failure with this patch, in TestThaiAnalyzer (in contrib/analyzers): {code} [junit] Testcase: testAnalyzer(org.apache.lucene.analysis.th.TestThaiAnalyzer): FAILED [junit] expected: but was: [junit] junit.framework.ComparisonFailure: expected: but was: [junit] at org.apache.lucene.analysis.th.TestThaiAnalyzer.assertAnalyzesTo(TestThaiAnalyzer.java:43) [junit] at org.apache.lucene.analysis.th.TestThaiAnalyzer.testAnalyzer(TestThaiAnalyzer.java:54) [junit] {code} Does anyone else see this? > Simplify StandardTokenizer JFlex grammar > > > Key: LUCENE-1126 > URL: https://issues.apache.org/jira/browse/LUCENE-1126 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.2 >Reporter: Steven Rowe >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1126.patch > > > Summary of thread entitled "Fullwidth alphanumeric characters, plus a > question on Korean ranges" begun by Daniel Noll on java-user, and carried > over to java-dev: > On 01/07/2008 at 5:06 PM, Daniel Noll wrote: > > I wish the tokeniser could just use Character.isLetter and > > Character.isDigit instead of having to know all the ranges itself, since > > the JRE already has all this information. Character.isLetter does > > return true for CJK characters though, so the ranges would still come in > > handy for determining what kind of letter they are. I don't support > > JFlex has a way to do this... > The DIGIT macro could be replaced by JFlex's predefined character class > [:digit:], which has the same semantics as java.lang.Character.isDigit(). > Although JFlex's predefined character class [:letter:] (same semantics as > java.lang.Character.isLetter()) includes CJK characters, there is a way to > handle this using JFlex's regex negation syntax {{!}}. From [the JFlex > documentation|http://jflex.de/manual.html]: > bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is > !(!{{a}}|{{b}}) > So to exclude CJ characters from the LETTER macro: > {code} > LETTER = ! ( ! [:letter:] | {CJ} ) > {code} > > Since [:letter:] includes all of the Korean ranges, there's no reason > (AFAICT) to treat them separately; unlike Chinese and Japanese characters, > which are individually tokenized, the Korean characters should participate in > the same token boundary rules as all of the other letters. > I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 > supports, and Unicode 5.0, the latest version, and there are lots of new and > modified letter and digit ranges. This stuff gets tweaked all the time, and > I don't think Lucene should be in the business of trying to track it, or take > a position on which Unicode version users' data should conform to. > Switching to using JFlex's [:letter:] and [:digit:] predefined character > classes ties (most of) these decisions to the user's choice of JVM version, > and this seems much more reasonable to me than the current status quo. > I will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1374) Merging of compressed string Fields may hit NPE
Merging of compressed string Fields may hit NPE --- Key: LUCENE-1374 URL: https://issues.apache.org/jira/browse/LUCENE-1374 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.4 This bug was introduced with LUCENE-1219 (only present on 2.4). The bug happens when merging compressed string fields, but only if bulk-merging code does not apply because the FieldInfos for the segment being merged are not congruent. This test shows the bug: {code} public void testMergeCompressedFields() throws IOException { File indexDir = new File(System.getProperty("tempDir"), "mergecompressedfields"); Directory dir = FSDirectory.getDirectory(indexDir); try { for(int i=0;i<5;i++) { // Must make a new writer & doc each time, w/ // different fields, so bulk merge of stored fields // cannot run: IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, IndexWriter.MaxFieldLength.UNLIMITED); w.setMergeFactor(5); w.setMergeScheduler(new SerialMergeScheduler()); Document doc = new Document(); doc.add(new Field("test1", "this is some data that will be compressed this this this", Field.Store.COMPRESS, Field.Index.NO)); doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); doc.add(new Field("field" + i, "random field", Field.Store.NO, Field.Index.TOKENIZED)); w.addDocument(doc); w.close(); } byte[] cmp = new byte[20]; IndexReader r = IndexReader.open(dir); for(int i=0;i<5;i++) { Document doc = r.document(i); assertEquals("this is some data that will be compressed this this this", doc.getField("test1").stringValue()); byte[] b = doc.getField("test2").binaryValue(); assertTrue(Arrays.equals(b, cmp)); } } finally { dir.close(); _TestUtil.rmDir(indexDir); } } {code} It's because in FieldsReader, when we load a field "for merge" we create a FieldForMerge instance which subsequently does not return the right values for getBinary{Value,Length,Offset}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1374) Merging of compressed string Fields may hit NPE
[ https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1374: --- Attachment: LUCENE-1374.patch Attached patch that fixes AbstractField's getBinaryValue() and getBinaryLength() methods to fallback to "fieldsData instanceof byte[]" when appropriate. I plan to commit shortly. > Merging of compressed string Fields may hit NPE > --- > > Key: LUCENE-1374 > URL: https://issues.apache.org/jira/browse/LUCENE-1374 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-1374.patch > > > This bug was introduced with LUCENE-1219 (only present on 2.4). > The bug happens when merging compressed string fields, but only if > bulk-merging code does not apply because the FieldInfos for the segment being > merged are not congruent. This test shows the bug: > {code} > public void testMergeCompressedFields() throws IOException { > File indexDir = new File(System.getProperty("tempDir"), > "mergecompressedfields"); > Directory dir = FSDirectory.getDirectory(indexDir); > try { > for(int i=0;i<5;i++) { > // Must make a new writer & doc each time, w/ > // different fields, so bulk merge of stored fields > // cannot run: > IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, > IndexWriter.MaxFieldLength.UNLIMITED); > w.setMergeFactor(5); > w.setMergeScheduler(new SerialMergeScheduler()); > Document doc = new Document(); > doc.add(new Field("test1", "this is some data that will be compressed > this this this", Field.Store.COMPRESS, Field.Index.NO)); > doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); > doc.add(new Field("field" + i, "random field", Field.Store.NO, > Field.Index.TOKENIZED)); > w.addDocument(doc); > w.close(); > } > byte[] cmp = new byte[20]; > IndexReader r = IndexReader.open(dir); > for(int i=0;i<5;i++) { > Document doc = r.document(i); > assertEquals("this is some data that will be compressed this this > this", doc.getField("test1").stringValue()); > byte[] b = doc.getField("test2").binaryValue(); > assertTrue(Arrays.equals(b, cmp)); > } > } finally { > dir.close(); > _TestUtil.rmDir(indexDir); > } > } > {code} > It's because in FieldsReader, when we load a field "for merge" we create a > FieldForMerge instance which subsequently does not return the right values > for getBinary{Value,Length,Offset}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving SweetSpotSimilarity out of contrib": > > : >From a legal standpoint, whenever we need to use open-source code, somebody > : has to inspect the code and 'approve' it. This inspection makes sure there's > : no use of 3rd party libraries, to which we'd need to get open-source > : clearance as well. > > You should talk to whomever you need to talk to at your company about > revising the appraoch you are taking -- the core vs contrib distinction in > Lucene-Java is one of our own making that is completly artificial. With > Lucene 2.4 we could decide to split what is currently known as the "core" > into 27 different directories, none of which are called core, and all of > which have an interdependency on eachother. We're not likely to, but we > could -- and then where woud your company be? I can't really defend the lawyers (sometimes you get the feeling that they are out to slow you down, rather than help you :( ), but let me try to explain where this sort of thinking comes from, because I think it is actually quite common. Lucene makes the claim that it has the "apache license", so that any company can (to make a long story short) use this code. But when a company sets out to use Lucene, can it take this claim at face value? After all, what happens if somebody steals some proprietary code and puts it up on the web claiming it has the apache license - does it give the users of that stolen code any rights? Of course not, because the rights weren't the distributor's to give out in the first place. So it is quite natural that when a company wants to use use some open-source code it doesn't take the license at face value, and rather does some "due diligance" to verify that the people who published this code really owned the rights to it. E.g., the company lawyers might want to do some background checks on the committers, look at the project's history (e.g., that it doesn't have some "out of the blue" donations from vague sources), check the code and comments for suspicious strings, patterns, and so on. When you need to inspect the code, naturally you need to decide what you inspect. This particular company chose to inspect only the Lucene core, perhaps because it is smaller, has fewer contributors, and has the vast majority of what most Lucene users need. Inspecting all the contrib - with all its foreign language analyzers, stuff like gdata and other rarely used stuff - may be too hard for them. But then, the question I would ask is - why not inspect the core *and* the few contribs that interest you? For example, SweetSpotSimilarity (which you need) and other generally useful stuff like Highlighter and SnowballAnalyzer. > Doing this would actually be a complete reversal of the goals discussed in > the near past: increasing our use of the contrib structure for new > features that aren't inherently tied to the "guts" of Lucene. The goal > being to keep the "core" jar as small as possible for people who want to > develop apps with a small foot print. I agree that this is an important goal. > At one point there was even talk of refactoring additional code out of the > core and into a contrib (this was already done with some analyzers when > Lucene became a TLP) -- Nadav Har'El| Wednesday, Sep 3 2008, 3 Elul 5768 IBM Haifa Research Lab |- |Promises are like babies: fun to make, http://nadav.harel.org.il |but hell to deliver. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
Thanks all for the "legal" comments. Can we consider moving the SweetSpotSimilarity to "core" because of the quality improvements it introduces to search? I tried to emphasize that that's the main reason, but perhaps I didn't do a good job at that, since the discussion has turned into a legal issue :-). On Wed, Sep 3, 2008 at 3:21 PM, Nadav Har'El <[EMAIL PROTECTED]>wrote: > On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving > SweetSpotSimilarity out of contrib": > > > > : >From a legal standpoint, whenever we need to use open-source code, > somebody > > : has to inspect the code and 'approve' it. This inspection makes sure > there's > > : no use of 3rd party libraries, to which we'd need to get open-source > > : clearance as well. > > > > You should talk to whomever you need to talk to at your company about > > revising the appraoch you are taking -- the core vs contrib distinction > in > > Lucene-Java is one of our own making that is completly artificial. With > > Lucene 2.4 we could decide to split what is currently known as the "core" > > into 27 different directories, none of which are called core, and all of > > which have an interdependency on eachother. We're not likely to, but we > > could -- and then where woud your company be? > > I can't really defend the lawyers (sometimes you get the feeling that they > are out to slow you down, rather than help you :( ), but let me try to > explain > where this sort of thinking comes from, because I think it is actually > quite > common. > > Lucene makes the claim that it has the "apache license", so that any > company > can (to make a long story short) use this code. But when a company sets out > to use Lucene, can it take this claim at face value? After all, what > happens > if somebody steals some proprietary code and puts it up on the web claiming > it > has the apache license - does it give the users of that stolen code any > rights? Of course not, because the rights weren't the distributor's to give > out in the first place. > > So it is quite natural that when a company wants to use use some > open-source > code it doesn't take the license at face value, and rather does some "due > diligance" to verify that the people who published this code really owned > the rights to it. E.g., the company lawyers might want to do some > background > checks on the committers, look at the project's history (e.g., that it > doesn't > have some "out of the blue" donations from vague sources), check the code > and > comments for suspicious strings, patterns, and so on. > > When you need to inspect the code, naturally you need to decide what you > inspect. This particular company chose to inspect only the Lucene core, > perhaps because it is smaller, has fewer contributors, and has the vast > majority of what most Lucene users need. Inspecting all the contrib - with > all its foreign language analyzers, stuff like gdata and other rarely used > stuff - may be too hard for them. But then, the question I would ask is - > why not inspect the core *and* the few contribs that interest you? For > example, SweetSpotSimilarity (which you need) and other generally useful > stuff like Highlighter and SnowballAnalyzer. > > > Doing this would actually be a complete reversal of the goals discussed > in > > the near past: increasing our use of the contrib structure for new > > features that aren't inherently tied to the "guts" of Lucene. The goal > > being to keep the "core" jar as small as possible for people who want to > > develop apps with a small foot print. > > I agree that this is an important goal. > > > At one point there was even talk of refactoring additional code out of > the > > core and into a contrib (this was already done with some analyzers when > > Lucene became a TLP) > > -- > Nadav Har'El| Wednesday, Sep 3 2008, 3 Elul > 5768 > IBM Haifa Research Lab > |- >|Promises are like babies: fun to make, > http://nadav.harel.org.il |but hell to deliver. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Moving SweetSpotSimilarity out of contrib
I think its a fair question that, regardless of the legal mumbo jumbo provoking it, can be considered on the merits that it should be - is it something important enough to bulk up the core with the trade off being more people will find it helpful and can use it with slightly less hassle? I have seen discussion about about core vs contrib before, and from what I saw, the distinction and rules are not quite clear. I would think though, if the new Similarity is really that much better than the old, it might actually benefit in core. There is no doubt core gets more attention on both the user and developer side, and important pieces with general usages should probably be there. I havn't used it myself, so I won't guess (too much ), but the question to me seems to be, is SweetSpot important enough to move to core? Are there enough good reasons? And even if so, is it ready to move to core? Contrib also seems to be somewhat of a possible incubation area... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627990#action_12627990 ] Grant Ingersoll commented on LUCENE-1373: - I think you should mirror what is done in StandardAnalyzer. You probably could create an abstract class that all of them inherit to share the common code. Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one. This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe. > Most of the contributed Analyzers suffer from invalid recognition of acronyms. > -- > > Key: LUCENE-1373 > URL: https://issues.apache.org/jira/browse/LUCENE-1373 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis, contrib/analyzers >Affects Versions: 2.3.2 >Reporter: Mark Lassau >Priority: Minor > > LUCENE-1068 describes a bug in StandardTokenizer whereby a string like > "www.apache.org." would be incorrectly tokenized as an acronym (note the dot > at the end). > Unfortunately, keeping the "backward compatibility" of a bug turns out to > harm us. > StandardTokenizer has a couple of ways to indicate "fix this bug", but > unfortunately the default behaviour is still to be buggy. > Most of the non-English analyzers provided in lucene-analyzers utilize the > StandardTokenizer, and in v2.3.2 not one of these provides a way to get the > non-buggy behaviour :( > I refer to: > * BrazilianAnalyzer > * CzechAnalyzer > * DutchAnalyzer > * FrenchAnalyzer > * GermanAnalyzer > * GreekAnalyzer > * ThaiAnalyzer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
Not tried SweetSpot so can't comment on worthiness of moving to core but agree with the principle that we can't let the hassles of a company's "due diligence" testing dictate the shape of core vs contrib. For anyone concerned with the overhead of doing these checks a company/product of potential interest is "Black Duck". I don't work for them and don't offer any endorsement but simply point them out as something you might want to take a look at. Cheers Mark - Original Message From: Nadav Har'El <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Wednesday, 3 September, 2008 13:21:34 Subject: Re: Moving SweetSpotSimilarity out of contrib On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving SweetSpotSimilarity out of contrib": > > : >From a legal standpoint, whenever we need to use open-source code, somebody > : has to inspect the code and 'approve' it. This inspection makes sure there's > : no use of 3rd party libraries, to which we'd need to get open-source > : clearance as well. > > You should talk to whomever you need to talk to at your company about > revising the appraoch you are taking -- the core vs contrib distinction in > Lucene-Java is one of our own making that is completly artificial. With > Lucene 2.4 we could decide to split what is currently known as the "core" > into 27 different directories, none of which are called core, and all of > which have an interdependency on eachother. We're not likely to, but we > could -- and then where woud your company be? I can't really defend the lawyers (sometimes you get the feeling that they are out to slow you down, rather than help you :( ), but let me try to explain where this sort of thinking comes from, because I think it is actually quite common. Lucene makes the claim that it has the "apache license", so that any company can (to make a long story short) use this code. But when a company sets out to use Lucene, can it take this claim at face value? After all, what happens if somebody steals some proprietary code and puts it up on the web claiming it has the apache license - does it give the users of that stolen code any rights? Of course not, because the rights weren't the distributor's to give out in the first place. So it is quite natural that when a company wants to use use some open-source code it doesn't take the license at face value, and rather does some "due diligance" to verify that the people who published this code really owned the rights to it. E.g., the company lawyers might want to do some background checks on the committers, look at the project's history (e.g., that it doesn't have some "out of the blue" donations from vague sources), check the code and comments for suspicious strings, patterns, and so on. When you need to inspect the code, naturally you need to decide what you inspect. This particular company chose to inspect only the Lucene core, perhaps because it is smaller, has fewer contributors, and has the vast majority of what most Lucene users need. Inspecting all the contrib - with all its foreign language analyzers, stuff like gdata and other rarely used stuff - may be too hard for them. But then, the question I would ask is - why not inspect the core *and* the few contribs that interest you? For example, SweetSpotSimilarity (which you need) and other generally useful stuff like Highlighter and SnowballAnalyzer. > Doing this would actually be a complete reversal of the goals discussed in > the near past: increasing our use of the contrib structure for new > features that aren't inherently tied to the "guts" of Lucene. The goal > being to keep the "core" jar as small as possible for people who want to > develop apps with a small foot print. I agree that this is an important goal. > At one point there was even talk of refactoring additional code out of the > core and into a contrib (this was already done with some analyzers when > Lucene became a TLP) -- Nadav Har'El| Wednesday, Sep 3 2008, 3 Elul 5768 IBM Haifa Research Lab |- |Promises are like babies: fun to make, http://nadav.harel.org.il |but hell to deliver. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1374) Merging of compressed string Fields may hit NPE
[ https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1374. Resolution: Fixed Committed revision 691617. > Merging of compressed string Fields may hit NPE > --- > > Key: LUCENE-1374 > URL: https://issues.apache.org/jira/browse/LUCENE-1374 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-1374.patch > > > This bug was introduced with LUCENE-1219 (only present on 2.4). > The bug happens when merging compressed string fields, but only if > bulk-merging code does not apply because the FieldInfos for the segment being > merged are not congruent. This test shows the bug: > {code} > public void testMergeCompressedFields() throws IOException { > File indexDir = new File(System.getProperty("tempDir"), > "mergecompressedfields"); > Directory dir = FSDirectory.getDirectory(indexDir); > try { > for(int i=0;i<5;i++) { > // Must make a new writer & doc each time, w/ > // different fields, so bulk merge of stored fields > // cannot run: > IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, > IndexWriter.MaxFieldLength.UNLIMITED); > w.setMergeFactor(5); > w.setMergeScheduler(new SerialMergeScheduler()); > Document doc = new Document(); > doc.add(new Field("test1", "this is some data that will be compressed > this this this", Field.Store.COMPRESS, Field.Index.NO)); > doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); > doc.add(new Field("field" + i, "random field", Field.Store.NO, > Field.Index.TOKENIZED)); > w.addDocument(doc); > w.close(); > } > byte[] cmp = new byte[20]; > IndexReader r = IndexReader.open(dir); > for(int i=0;i<5;i++) { > Document doc = r.document(i); > assertEquals("this is some data that will be compressed this this > this", doc.getField("test1").stringValue()); > byte[] b = doc.getField("test2").binaryValue(); > assertTrue(Arrays.equals(b, cmp)); > } > } finally { > dir.close(); > _TestUtil.rmDir(indexDir); > } > } > {code} > It's because in FieldsReader, when we load a field "for merge" we create a > FieldForMerge instance which subsequently does not return the right values > for getBinary{Value,Length,Offset}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi Phrase Search at the Beginning of a field
Excellent, it worked :) Thank you Tori!! Regards, Vinay >-Original Message- >From: ext Andraz Tori [mailto:[EMAIL PROTECTED] >Sent: 01 September, 2008 16:39 >To: java-dev@lucene.apache.org >Subject: Re: Multi Phrase Search at the Beginning of a field > >You can use standard trick. > >Insert a special token at the beginning of every field you are >indexing, and add that special token to beginning of every query. > >Since this token will not occur anywhere else in the field, >you will know that your queries match only beginnings of fields > >bye >andraz > >On Mon, 2008-09-01 at 15:50 +0300, [EMAIL PROTECTED] wrote: >> Hi, >> >> Can some one please help me in providing a solution for my problem: >> >> I have a single field defined in my document. Now I want to do a >> MultiPhraseQuery - but at the beginning of the field. >> >> For e.g: If there are 3 documents with single field ( say 'title' ) >> has the values -> "Hello Love you", "Love You Sister", "Love Yoxyz" >> >> Then my search for "Love yo*" -> MultiPhraseQuery with first term >> "Love" ( using addTerm("Love") and the next terms ( using >> addTerms("Yo*" - after getting all terms 'You' and 'Yoxyz' using >> IndexReader.terms(Yo) ) should return only the documents "Love You >> Sister", "Love Yoxyz" - but not "Hello Love you". >> >> Can some one please help me on how to get it done. >> >> >> Regards, >> Vinay >> >-- >Andraz Tori, CTO >Zemanta Ltd, London, Ljubljana >www.zemanta.com >mail: [EMAIL PROTECTED] >tel: +386 41 515 767 >twitter: andraz, skype: minmax_test > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > >
[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system
[ https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628025#action_12628025 ] Ning Li commented on LUCENE-532: Is the use of seek and write in ChecksumIndexOutput making Lucene less likely to support all sequential write (i.e. no seek write)? ChecksumIndexOutput is currently used by SegmentInfos. > [PATCH] Indexing on Hadoop distributed file system > -- > > Key: LUCENE-532 > URL: https://issues.apache.org/jira/browse/LUCENE-532 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 1.9 >Reporter: Igor Bolotin >Priority: Minor > Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, > TermInfosWriter.patch > > > In my current project we needed a way to create very large Lucene indexes on > Hadoop distributed file system. When we tried to do it directly on DFS using > Nutch FsDirectory class - we immediately found that indexing fails because > DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason > for this behavior is clear - DFS does not support random updates and so > seek() method can't be supported (at least not easily). > > Well, if we can't support random updates - the question is: do we really need > them? Search in the Lucene code revealed 2 places which call > IndexOutput.seek() method: one is in TermInfosWriter and another one in > CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the > only place that concerned us was in TermInfosWriter. > > TermInfosWriter uses IndexOutput.seek() in its close() method to write total > number of terms in the file back into the beginning of the file. It was very > simple to change file format a little bit and write number of terms into last > 8 bytes of the file instead of writing them into beginning of file. The only > other place that should be fixed in order for this to work is in > SegmentTermEnum constructor - to read this piece of information at position = > file length - 8. > > With this format hack - we were able to use FsDirectory to write index > directly to DFS without any problems. Well - we still don't index directly to > DFS for performance reasons, but at least we can build small local indexes > and merge them into the main index on DFS without copying big main index back > and forth. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can I filter the results returned by IndexReader.terms(term)?
I am using IndexReader.terms(term) to produce term suggestions to my users as they type. In many cases the user is searching lucene with a filter applied, for example a date range. Is there any way I can get a list of terms in the index that are contained within a subset of the documents by a given filter. i.e. I'd like to do something like ... IndexReader reader = readerProvider.openReader(directoryProvider); reader.filterDocument(filter); TermEnum termEnum = reader.terms(new Term("name", "")); ...Iterate on terms I've scouted all over the API and I cannot find how to do this or if it is possible. Please let me know if it can be done and if so how. Thanks! -- View this message in context: http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28term%29--tp19292207p19292207.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I filter the results returned by IndexReader.terms(term)?
One way is to read TermDocs for each candidate term and see if they are in your filter - but that sounds like a lot of disk IO to me when responding to individual user keystrokes. You can use "skip" to avoid reading all term docs when you know what is in the filter but it all seems a bit costly. It's hard to optimise in advance for this, especially if the filter is an arbitrary choice of documents for each user. - Original Message From: AdrianPillinger <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Wednesday, 3 September, 2008 16:54:11 Subject: Can I filter the results returned by IndexReader.terms(term)? I am using IndexReader.terms(term) to produce term suggestions to my users as they type. In many cases the user is searching lucene with a filter applied, for example a date range. Is there any way I can get a list of terms in the index that are contained within a subset of the documents by a given filter. i.e. I'd like to do something like ... IndexReader reader = readerProvider.openReader(directoryProvider); reader.filterDocument(filter); TermEnum termEnum = reader.terms(new Term("name", "")); ...Iterate on terms I've scouted all over the API and I cannot find how to do this or if it is possible. Please let me know if it can be done and if so how. Thanks! -- View this message in context: http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28term%29--tp19292207p19292207.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1374) Merging of compressed string Fields may hit NPE
[ https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628055#action_12628055 ] Chris Harris commented on LUCENE-1374: -- "ant test" on 691617 for me fails on the following test: java.io.IOException: could not delete C:\lucene\691647\build\test\mergecompressedfields\_5.cfs at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37) at org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111) It might be one of those things that shows up only on Windows. In any case, adding a call to IndexReader.close() in testMergeCompressedFields() seems to fix things up: IndexReader r = IndexReader.open(dir); for(int i=0;i<5;i++) { Document doc = r.document(i); assertEquals("this is some data that will be compressed this this this", doc.getField("test1").stringValue()); byte[] b = doc.getField("test2").binaryValue(); assertTrue(Arrays.equals(b, cmp)); } r.close(); // <--- New line } finally { dir.close(); _TestUtil.rmDir(indexDir); } I guess technically the r.close() probably belongs in a finally block as well. > Merging of compressed string Fields may hit NPE > --- > > Key: LUCENE-1374 > URL: https://issues.apache.org/jira/browse/LUCENE-1374 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-1374.patch > > > This bug was introduced with LUCENE-1219 (only present on 2.4). > The bug happens when merging compressed string fields, but only if > bulk-merging code does not apply because the FieldInfos for the segment being > merged are not congruent. This test shows the bug: > {code} > public void testMergeCompressedFields() throws IOException { > File indexDir = new File(System.getProperty("tempDir"), > "mergecompressedfields"); > Directory dir = FSDirectory.getDirectory(indexDir); > try { > for(int i=0;i<5;i++) { > // Must make a new writer & doc each time, w/ > // different fields, so bulk merge of stored fields > // cannot run: > IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, > IndexWriter.MaxFieldLength.UNLIMITED); > w.setMergeFactor(5); > w.setMergeScheduler(new SerialMergeScheduler()); > Document doc = new Document(); > doc.add(new Field("test1", "this is some data that will be compressed > this this this", Field.Store.COMPRESS, Field.Index.NO)); > doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); > doc.add(new Field("field" + i, "random field", Field.Store.NO, > Field.Index.TOKENIZED)); > w.addDocument(doc); > w.close(); > } > byte[] cmp = new byte[20]; > IndexReader r = IndexReader.open(dir); > for(int i=0;i<5;i++) { > Document doc = r.document(i); > assertEquals("this is some data that will be compressed this this > this", doc.getField("test1").stringValue()); > byte[] b = doc.getField("test2").binaryValue(); > assertTrue(Arrays.equals(b, cmp)); > } > } finally { > dir.close(); > _TestUtil.rmDir(indexDir); > } > } > {code} > It's because in FieldsReader, when we load a field "for merge" we create a > FieldForMerge instance which subsequently does not return the right values > for getBinary{Value,Length,Offset}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1374) Merging of compressed string Fields may hit NPE
[ https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628055#action_12628055 ] ryguasu edited comment on LUCENE-1374 at 9/3/08 10:07 AM: --- "ant test" on 691617 for me fails on the following test: java.io.IOException: could not delete C:\lucene\691647\build\test\mergecompressedfields\_5.cfs at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37) at org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111) It might be one of those things that shows up only on Windows. In any case, adding a call to IndexReader.close() in testMergeCompressedFields() seems to fix things up: {code} IndexReader r = IndexReader.open(dir); for(int i=0;i<5;i++) { Document doc = r.document(i); assertEquals("this is some data that will be compressed this this this", doc.getField("test1").stringValue()); byte[] b = doc.getField("test2").binaryValue(); assertTrue(Arrays.equals(b, cmp)); } r.close(); // <--- New line } finally { dir.close(); _TestUtil.rmDir(indexDir); } {code} I guess technically the r.close() probably belongs in a finally block as well. was (Author: ryguasu): "ant test" on 691617 for me fails on the following test: java.io.IOException: could not delete C:\lucene\691647\build\test\mergecompressedfields\_5.cfs at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37) at org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111) It might be one of those things that shows up only on Windows. In any case, adding a call to IndexReader.close() in testMergeCompressedFields() seems to fix things up: IndexReader r = IndexReader.open(dir); for(int i=0;i<5;i++) { Document doc = r.document(i); assertEquals("this is some data that will be compressed this this this", doc.getField("test1").stringValue()); byte[] b = doc.getField("test2").binaryValue(); assertTrue(Arrays.equals(b, cmp)); } r.close(); // <--- New line } finally { dir.close(); _TestUtil.rmDir(indexDir); } I guess technically the r.close() probably belongs in a finally block as well. > Merging of compressed string Fields may hit NPE > --- > > Key: LUCENE-1374 > URL: https://issues.apache.org/jira/browse/LUCENE-1374 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-1374.patch > > > This bug was introduced with LUCENE-1219 (only present on 2.4). > The bug happens when merging compressed string fields, but only if > bulk-merging code does not apply because the FieldInfos for the segment being > merged are not congruent. This test shows the bug: > {code} > public void testMergeCompressedFields() throws IOException { > File indexDir = new File(System.getProperty("tempDir"), > "mergecompressedfields"); > Directory dir = FSDirectory.getDirectory(indexDir); > try { > for(int i=0;i<5;i++) { > // Must make a new writer & doc each time, w/ > // different fields, so bulk merge of stored fields > // cannot run: > IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, > IndexWriter.MaxFieldLength.UNLIMITED); > w.setMergeFactor(5); > w.setMergeScheduler(new SerialMergeScheduler()); > Document doc = new Document(); > doc.add(new Field("test1", "this is some data that will be compressed > this this this", Field.Store.COMPRESS, Field.Index.NO)); > doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); > doc.add(new Field("field" + i, "random field", Field.Store.NO, > Field.Index.TOKENIZED)); > w.addDocument(doc); > w.close(); > } > byte[] cmp = new byte[20]; > IndexReader r = IndexReader.open(dir); > for(int i=0;i<5;i++) { > Document doc = r.document(i); > assertEquals("this is some data that will be compressed this this > this", doc.getField("test1").stringValue()); > byte[] b = doc.getField("test2").binaryValue(); > assertTrue(Arrays.equals(b, cmp)); > } > } finally { > dir.close(); > _TestUtil.rmDir(indexDir); > } > } > {code} > It's because in FieldsReader, when we load a field "for merge" we create a > FieldForMerge instance which subsequently does not return the right values > for getBinary{Value,Length,Offset}. -- This message is automatically gen
Re: Can I filter the results returned by IndexReader.terms(term)?
Another way is to use the trunk, where Scorer is a subclass of DocIdSetIterator, which is returned by a Filter. This allows to create a TermFilter that returns a TermScorer (which is based on TermEnum internally.) Try wrapping it in a CachingWrapperFilter when it needs to be reused. Finally, have a look here to see whether it could help in your case: https://issues.apache.org/jira/browse/LUCENE-1296 Regards, Paul Elschot Op Wednesday 03 September 2008 18:00:27 schreef mark harwood: > One way is to read TermDocs for each candidate term and see if they > are in your filter - but that sounds like a lot of disk IO to me when > responding to individual user keystrokes. You can use "skip" to avoid > reading all term docs when you know what is in the filter but it all > seems a bit costly. > > It's hard to optimise in advance for this, especially if the filter > is an arbitrary choice of documents for each user. > > > > - Original Message > From: AdrianPillinger <[EMAIL PROTECTED]> > To: java-dev@lucene.apache.org > Sent: Wednesday, 3 September, 2008 16:54:11 > Subject: Can I filter the results returned by > IndexReader.terms(term)? > > > I am using IndexReader.terms(term) to produce term suggestions to my > users as they type. In many cases the user is searching lucene with a > filter applied, for example a date range. > > Is there any way I can get a list of terms in the index that are > contained within a subset of the documents by a given filter. > > i.e. I'd like to do something like > > ... > IndexReader reader = readerProvider.openReader(directoryProvider); > reader.filterDocument(filter); > TermEnum termEnum = reader.terms(new Term("name", " term>")); ...Iterate on terms > > > > I've scouted all over the API and I cannot find how to do this or if > it is possible. > > Please let me know if it can be done and if so how. > > Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1374) Merging of compressed string Fields may hit NPE
[ https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628067#action_12628067 ] Michael McCandless commented on LUCENE-1374: Woops, you're right: I too see that failure (to rmDir the directory) only on Windows. I'll commit a fix. Thanks Chris! > Merging of compressed string Fields may hit NPE > --- > > Key: LUCENE-1374 > URL: https://issues.apache.org/jira/browse/LUCENE-1374 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-1374.patch > > > This bug was introduced with LUCENE-1219 (only present on 2.4). > The bug happens when merging compressed string fields, but only if > bulk-merging code does not apply because the FieldInfos for the segment being > merged are not congruent. This test shows the bug: > {code} > public void testMergeCompressedFields() throws IOException { > File indexDir = new File(System.getProperty("tempDir"), > "mergecompressedfields"); > Directory dir = FSDirectory.getDirectory(indexDir); > try { > for(int i=0;i<5;i++) { > // Must make a new writer & doc each time, w/ > // different fields, so bulk merge of stored fields > // cannot run: > IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, > IndexWriter.MaxFieldLength.UNLIMITED); > w.setMergeFactor(5); > w.setMergeScheduler(new SerialMergeScheduler()); > Document doc = new Document(); > doc.add(new Field("test1", "this is some data that will be compressed > this this this", Field.Store.COMPRESS, Field.Index.NO)); > doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS)); > doc.add(new Field("field" + i, "random field", Field.Store.NO, > Field.Index.TOKENIZED)); > w.addDocument(doc); > w.close(); > } > byte[] cmp = new byte[20]; > IndexReader r = IndexReader.open(dir); > for(int i=0;i<5;i++) { > Document doc = r.document(i); > assertEquals("this is some data that will be compressed this this > this", doc.getField("test1").stringValue()); > byte[] b = doc.getField("test2").binaryValue(); > assertTrue(Arrays.equals(b, cmp)); > } > } finally { > dir.close(); > _TestUtil.rmDir(indexDir); > } > } > {code} > It's because in FieldsReader, when we load a field "for merge" we create a > FieldForMerge instance which subsequently does not return the right values > for getBinary{Value,Length,Offset}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
: saw, the distinction and rules are not quite clear. I would think though, if : the new Similarity is really that much better than the old, it might actually : benefit in core. There is no doubt core gets more attention on both the user : and developer side, and important pieces with general usages should probably : be there. I see a Chicken/Egg argument here ... Perhaps contribs would get more attention if we used them more -- as in: put more stuff in them. : I havn't used it myself, so I won't guess (too much ), but the question to : me seems to be, is SweetSpot important enough to move to core? Are there : enough good reasons? And even if so, is it ready to move to core? Contrib also : seems to be somewhat of a possible incubation area... I think that's the wrong question to ask. I would rather ask the question "Is X decoupled enough from Lucene internals that it can be a contrib?" Things like IndexWriter, IndexReader, Document and TokenStream really need to be "core" ... but things like the QueryParser, and most of our analyzers don't. Having lots of loosely coupled mini-libraries that respect good API boundaries seems more reusable and generally saner then "all of this code is useful and lots of people wnat it so throw it into the kitchen sink" We don't need to go hog wild gutting things out of the core ... but i don't think we should be adding new things to the core just becuase they are "generally useful". -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
I would agree with you if I was wrong about the contrib/core attention thing, but I don't think I am. It seems as if you have been arguing that contrib is really just an extension of core, on par with core, but just in different libs, and to keep core lean and mean, anything not needed in core shouldn't be there - sounds like an idea I could get behind, but seems to ignore the reality: The user/dev focus definitely seems to be on core. Some of contrib is a graveyard in terms of dev and use I think. I think its still entangled in its "sandbox" roots. Contrib lacks many requirements of core code - it can be java 1.5, it doesn't have to be backward compatible, etc. Putting something in core ensures its treated as a Lucene first class citizen, stuff in contrib is not held to such strict standards. Even down to the people working on the code, there is a lower bar to become a contrib commiter than a full core committer (see my contrib committer status ). Its not that I don't like what you propose, but I don't buy it as very viable the way things are now. IMO we would need to do some work to make it a reality. It can be said thats the way it is, but my view of things doesnt jive with it. I may have mis written "generally useful". What I meant was, if the sweet spot sim is better than the default sim, but a bit harder to use because of config, perhaps it is "core" enough to go there, as often it may be better to use. Again, I fully believe it would get more attention and be 'better' maintained. I did not mean to set the bar at "generally useful" and I apologize for my imprecise language (one of my many faults). I think that's the wrong question to ask. I would rather ask the question "Is X decoupled enough from Lucene internals that it can be a contrib?" Things like IndexWriter, IndexReader, Document and TokenStream really need to be "core" ... but things like the QueryParser, and most of our analyzers don't. Having lots of loosely coupled mini-libraries that respect good API boundaries seems more reusable and generally saner then "all of this code is useful and lots of people wnat it so throw it into the kitchen sink" We don't need to go hog wild gutting things out of the core ... but i don't think we should be adding new things to the core just becuase they are "generally useful". -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1313) Ocean Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628092#action_12628092 ] Jason Rutherglen commented on LUCENE-1313: -- Is there a good place to place the javadocs on the Apache website once they are more complete? > Ocean Realtime Search > - > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided > into discrete transactions. The transaction is recorded to a transaction log > which is similar to the mysql bin log. Deletes from the transaction are made > to the existing indexes. Document additions are made to an in memory > InstantiatedIndex. The transaction is then complete. After each transaction > TransactionSystem.getSearcher() may be called which allows searching over the > index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are > provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List documents, Analyzer analyzer, List > deleteByTerms, List deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), > "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new > SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, > Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for > basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their > generation numbers correspond to a snapshot. Each transaction creates a new > snapshot file. In this file the 105 is the snapshotid, also known as the > transactionid. The 00 is the minor version of the snapshot corresponding to > a merge. A merge is a minor snapshot version because the data does not > change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log0001.bin | Log file. As new log files are created the suffix > number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Moving SweetSpotSimilarity out of contrib
On 09/03/2008 at 2:00 PM, Chris Hostetter wrote: > On 09/03/2008 at 8:40 AM, Mark Miller wrote: > > I havn't used it myself, so I won't guess (too much ), but the > > question to me seems to be, is SweetSpot important enough to move to > > core? Are there enough good reasons? And even if so, is it ready to > > move to core? Contrib also seems to be somewhat of a possible > > incubation area... > > I think that's the wrong question to ask. I would rather ask the > question "Is X decoupled enough from Lucene internals that it can be a > contrib?" Things like IndexWriter, IndexReader, Document and TokenStream > really need to be "core" ... but things like the QueryParser, and most > of our analyzers don't. Having lots of loosely coupled mini-libraries > that respect good API boundaries seems more reusable and generally saner > then "all of this code is useful and lots of people wnat it so throw it > into the kitchen sink" > > We don't need to go hog wild gutting things out of the core ... but i > don't think we should be adding new things to the core just > becuase they are "generally useful". One of core's requirements is: no external dependencies. Although many contrib components meet this requirement, there is no structural differentiation between them and those that don't. So from the point of view of simplifying lawyers' licensing labors :), it might make sense to split off a "contrib-no-ext-deps". Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
Another important driver is the "out-of-the-box experience". It's crucial that Lucene has good starting defaults for everything because many developers will stick with these defaults and won't discover the wiki page that says you need to do X, Y and Z to get better relevance, indexing speed, searching speed, etc. This then makes Lucene look bad, not only to these Lucene users but then also to the end users who use their apps that say "Powered by Lucene". It also affects Lucene's adoption/growth over time: when a potential new user is just "trying Lucene out" we want our defaults to shine because those new users will walk away if Lucene doesn't compare well to other engines that are well-tuned out-of-the-box. I remember a while back we discussed an article comparing performance of various search engines and we were disappointed that the author didn't do X, Y and Z to let Lucene compete fairly. If we had good defaults that wouldn't have happened (or, at least to a lesser extent). Obviously we can't default everything perfectly since at some point there are hard tradeoffs to be made and every app is different, but if SweetSpotSimilarity really gives better relevance for many/most apps, and doesn't have any downsides (I haven't looked closely myself), I think we should get it into core? You know... it's almost like we need a "standard distro" (drawing analogy to Linux) for Lucene, which would be the core plus cherry-pick certain important contrib modules (highlighter, SweetSpotSimilarity, snowball, spellchecker, etc.) and bundle them together. See, highlighting is obviously well "decoupled" from Lucene's core, so it should remain in contrib, yet is also cleary a very important function in nearly every search engine. Mike Mark Miller wrote: I would agree with you if I was wrong about the contrib/core attention thing, but I don't think I am. It seems as if you have been arguing that contrib is really just an extension of core, on par with core, but just in different libs, and to keep core lean and mean, anything not needed in core shouldn't be there - sounds like an idea I could get behind, but seems to ignore the reality: The user/dev focus definitely seems to be on core. Some of contrib is a graveyard in terms of dev and use I think. I think its still entangled in its "sandbox" roots. Contrib lacks many requirements of core code - it can be java 1.5, it doesn't have to be backward compatible, etc. Putting something in core ensures its treated as a Lucene first class citizen, stuff in contrib is not held to such strict standards. Even down to the people working on the code, there is a lower bar to become a contrib commiter than a full core committer (see my contrib committer status ). Its not that I don't like what you propose, but I don't buy it as very viable the way things are now. IMO we would need to do some work to make it a reality. It can be said thats the way it is, but my view of things doesnt jive with it. I may have mis written "generally useful". What I meant was, if the sweet spot sim is better than the default sim, but a bit harder to use because of config, perhaps it is "core" enough to go there, as often it may be better to use. Again, I fully believe it would get more attention and be 'better' maintained. I did not mean to set the bar at "generally useful" and I apologize for my imprecise language (one of my many faults). I think that's the wrong question to ask. I would rather ask the question "Is X decoupled enough from Lucene internals that it can be a contrib?" Things like IndexWriter, IndexReader, Document and TokenStream really need to be "core" ... but things like the QueryParser, and most of our analyzers don't. Having lots of loosely coupled mini-libraries that respect good API boundaries seems more reusable and generally saner then "all of this code is useful and lots of people wnat it so throw it into the kitchen sink" We don't need to go hog wild gutting things out of the core ... but i don't think we should be adding new things to the core just becuase they are "generally useful". -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
solr2: Onward and Upward
If you've considered Solr in the past, but for some reason it didn't meet your needs, we'd love to hear from you over on solr-dev. We're starting to do some forward looking architecture work on the next major version of Solr, so let us know what ideas you have and what you'd like to see! solr-dev thread: http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html#a19224805 -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Realtime Search for Social Networks Collaboration
Hello all, I don't mean this to sound like a solicitation. I've been working on realtime search and created some Lucene patches etc. I am wondering if there are social networks (or anyone else) out there who would be interested in collaborating with Apache on realtime search to get it to the point it can be used in production. It is a challenging problem that only Google has solved and made to scale. I've been working on the problem for a while and though a lot has been completed, there is still a lot more to do and collaboration amongst the most probable users (social networks) seems like a good thing to try to do at this point. I guess I'm saying it seems like a hard enough problem that perhaps it's best to work together on it rather than each company try to complete their own. However I could be wrong. Realtime search benefits social networks by providing a scalable searchable alternative to large Mysql implementations. Mysql I have heard is difficult to scale at a certain point. Apparently Google has created things like BigTable (a large database) and an online service called GData (which Google has not published any whitepapers on the technology underneath) to address scaling large database systems. BigTable does not offer search. GData does and is used by all of Google's web services instead of something like Mysql (this is at least how I understand it). Social networks usually grow and so scaling is continually an issue. It is possible to build a realtime search system that scales linearly, something that I have heard becomes difficult with Mysql. There is an article that discusses some of these issues http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337 I don't think the current GData implementation is perfect and there is a lot that can be improved on. It might be helpful to figure out together what helpful things can be added. If this sounds like something of interest to anyone feel free to send your input. Take care, Jason - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar
[ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106 ] Steven Rowe commented on LUCENE-1126: - Yeah, I see this too. The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the unpatched grammar's {LETTER} definition, which contains the huge range {{\u0100-\u1fff}}, much of which is not actually letters. The patched grammar instead substitutes the Unicode 3.0 {{Letter}} general category (via JFlex's [:letter:]), which excludes some characters in the Thai range: non-spacing marks, a currency symbol, numerals, etc. ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize the contiguous text (i.e. without whitespace) provided by StandardTokenizer. The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets {{"\u0e17"}}, because {{\u0e35}} is a non-spacing mark, which the patched StandardTokenizer doesn't pass to ThaiWordFilter. Because of this problem, I guess I'm -1 on applying the patch I provided. One solution would be to switch from using the {{Letter}} general category to the derived property {{Alphabetic}}, which includes both general categories {{Letter}} and {{Mark}}. (see Annex C of [the Unicode Regular Expressions Technical Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties] under "alpha" for discussion of this). The current version of JFlex does not support Unicode property references in its syntax, though, so simplifying -- and correcting -- the grammar may have to wait for the next version of JFlex, which will support syntax like {{\p{Alphabetic}}}. > Simplify StandardTokenizer JFlex grammar > > > Key: LUCENE-1126 > URL: https://issues.apache.org/jira/browse/LUCENE-1126 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.2 >Reporter: Steven Rowe >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1126.patch > > > Summary of thread entitled "Fullwidth alphanumeric characters, plus a > question on Korean ranges" begun by Daniel Noll on java-user, and carried > over to java-dev: > On 01/07/2008 at 5:06 PM, Daniel Noll wrote: > > I wish the tokeniser could just use Character.isLetter and > > Character.isDigit instead of having to know all the ranges itself, since > > the JRE already has all this information. Character.isLetter does > > return true for CJK characters though, so the ranges would still come in > > handy for determining what kind of letter they are. I don't support > > JFlex has a way to do this... > The DIGIT macro could be replaced by JFlex's predefined character class > [:digit:], which has the same semantics as java.lang.Character.isDigit(). > Although JFlex's predefined character class [:letter:] (same semantics as > java.lang.Character.isLetter()) includes CJK characters, there is a way to > handle this using JFlex's regex negation syntax {{!}}. From [the JFlex > documentation|http://jflex.de/manual.html]: > bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is > !(!{{a}}|{{b}}) > So to exclude CJ characters from the LETTER macro: > {code} > LETTER = ! ( ! [:letter:] | {CJ} ) > {code} > > Since [:letter:] includes all of the Korean ranges, there's no reason > (AFAICT) to treat them separately; unlike Chinese and Japanese characters, > which are individually tokenized, the Korean characters should participate in > the same token boundary rules as all of the other letters. > I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 > supports, and Unicode 5.0, the latest version, and there are lots of new and > modified letter and digit ranges. This stuff gets tweaked all the time, and > I don't think Lucene should be in the business of trying to track it, or take > a position on which Unicode version users' data should conform to. > Switching to using JFlex's [:letter:] and [:digit:] predefined character > classes ties (most of) these decisions to the user's choice of JVM version, > and this seems much more reasonable to me than the current status quo. > I will attach a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reopened LUCENE-1320: - Lucene Fields: [Patch Available] (was: [Patch Available, New]) Despite the fact that we allow contribs to be 1.5, I don't think the analysis package should be 1.5, at least it shouldn't be made 1.5 without some discussion on the mailing list. > ShingleMatrixFilter, a three dimensional permutating shingle filter > --- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Affects Versions: 2.3.2 >Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.4 > > Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle > tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries > with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, > payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more > than maximumShingleSize columns forward in the stream and clearing up unused > resources (columns and unique token sets). Can still be optimized quite a bit > though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1320: Priority: Blocker (was: Major) I'm marking this as a blocker for 2.4 based on the Java 1.5 incompatibilities that were introduced. > ShingleMatrixFilter, a three dimensional permutating shingle filter > --- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Affects Versions: 2.3.2 >Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Blocker > Fix For: 2.4 > > Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle > tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries > with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, > payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more > than maximumShingleSize columns forward in the stream and clearing up unused > resources (columns and unique token sets). Can still be optimized quite a bit > though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
>>Another important driver is the "out-of-the-box experience". >>we need a "standard distro" ...which would be the core plus cherry-pick certain important contrib modules (highlighter, >> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them together. Is that not Solr, or at least the start of a path that ultimately ends up there? I suspect any attempts at "bundling" Lucene code may snowball until you've rebuilt Solr. If anything I suspect a more interesting initiative might be to "unbundle" Solr and see some more of it's features emerge as standalone modules in Lucene/contrib (or a suitably renamed area e.g. "extensions")? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > I am wondering > if there are social networks (or anyone else) out there who would be > interested in collaborating with Apache on realtime search to get it > to the point it can be used in production. Good timing Jason, I think you'll find some other people right here at Apache (solr-dev) that want to collaborate in this area: http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html I've looked at your wiki briefly, and all the high level goals/features seem to really be synergistic with where we are going with Solr2. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
markharw00d wrote: >>Another important driver is the "out-of-the-box experience". >>we need a "standard distro" ...which would be the core plus cherry- pick certain important contrib modules (highlighter, >> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them together. Is that not Solr, or at least the start of a path that ultimately ends up there? I suspect any attempts at "bundling" Lucene code may snowball until you've rebuilt Solr. Yeah I guess it is... though Solr includes the whole webapp too, whereas I think there's a natural bundle that wouldn't include that. Still, I think it's important for Lucene itself to have strong defaults out of the box. If anything I suspect a more interesting initiative might be to "unbundle" Solr and see some more of it's features emerge as standalone modules in Lucene/contrib (or a suitably renamed area e.g. "extensions")? I like that! Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
On Wed, Sep 3, 2008 at 4:55 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: >> I suspect any attempts at "bundling" Lucene code may snowball until you've >> rebuilt Solr. > > Yeah I guess it is... though Solr includes the whole webapp too, whereas I > think there's a natural bundle that wouldn't include that. One thing we are looking at for Solr2 is making it more useful for advanced embedded users. I expect a non-webapp version too. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote: Obviously we can't default everything perfectly since at some point there are hard tradeoffs to be made and every app is different, but if SweetSpotSimilarity really gives better relevance for many/most apps, and doesn't have any downsides (I haven't looked closely myself), I think we should get it into core? Well, we only have 2 data points here: Hoss' original position that it was helpful, and Doron's Million Query work. Has anyone else reported benefit? And in that regard, the difference between OOTB and SweetSpot was 0.154 vs. 0.162 for MAP. Not a huge amount, but still useful. In that regard, there are other length normalization functions (namely approaches that don't favor very short documents as much) that I've seen benefit applications as well, but as Erik is (in)famous for saying "it depends". In fact, if we go solely based on the million query work, we'd be better off having the Query Parser create phrase queries automatically for any query w/ more than 1 term (0.19 vs 0.154) before we even touch length normalization. I've long argued that Lucene needs to take on the relevance question more head on, and in an open source way, until then, we are merely guessing at what's better, w/o empirical evidence that can be easily reproduced. TREC is just one data point, and is often discounted as being all that useful in the real world. I'm on the fence, though. I agree w/ Hoss that core should be "core" and I don't think we want to throw more and more into core, but I also agree w/ Mike in that we want good, intelligent defaults for what we do have in core. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628132#action_12628132 ] Karl Wettin commented on LUCENE-1320: - OK. Either remove it or place it in some alternative contrib module? The first chooise is obviously the easiest. > ShingleMatrixFilter, a three dimensional permutating shingle filter > --- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Affects Versions: 2.3.2 >Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Blocker > Fix For: 2.4 > > Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle > tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries > with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, > payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more > than maximumShingleSize columns forward in the stream and clearing up unused > resources (columns and unique token sets). Can still be optimized quite a bit > though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628154#action_12628154 ] Michael McCandless commented on LUCENE-1131: Otis is this one ready to go in? > Add numDeletedDocs to IndexReader > - > > Key: LUCENE-1131 > URL: https://issues.apache.org/jira/browse/LUCENE-1131 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Shai Erera >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1131.patch > > > Add numDeletedDocs to IndexReader. Basically, the implementation is as simple > as doing: > public int numDeletedDocs() { > return deletedDocs == null ? 0 : deletedDocs.count(); > } > in SegmentReader. > Patch to follow to include in all IndexReader extensions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token
[ https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1350. Resolution: Duplicate Fix Version/s: (was: 2.3.3) Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Isn't this one now a dup of LUCENE-1333? > Filters which are "consumers" should not reset the payload or flags and > should better reuse the token > - > > Key: LUCENE-1350 > URL: https://issues.apache.org/jira/browse/LUCENE-1350 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis, contrib/* >Reporter: Doron Cohen >Assignee: Doron Cohen > Fix For: 2.4 > > Attachments: LUCENE-1350-test.patch, LUCENE-1350.patch, > LUCENE-1350.patch > > > Passing tokens with payloads through SnowballFilter results in tokens with no > payloads. > A workaround for this is to apply stemming first and only then run whatever > logic creates the payload, but this is not always convenient. > Other "consumer" filters have similar problem. > These filters can - and should - reuse the token, by implementing > next(Token), effectively also fixing the unwanted resetting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token
[ https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628158#action_12628158 ] Doron Cohen commented on LUCENE-1350: - Yes it is a dup, thanks Mike for taking care of this (I planned to do this yesterday but didn't make it) > Filters which are "consumers" should not reset the payload or flags and > should better reuse the token > - > > Key: LUCENE-1350 > URL: https://issues.apache.org/jira/browse/LUCENE-1350 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis, contrib/* >Reporter: Doron Cohen >Assignee: Doron Cohen > Fix For: 2.4 > > Attachments: LUCENE-1350-test.patch, LUCENE-1350.patch, > LUCENE-1350.patch > > > Passing tokens with payloads through SnowballFilter results in tokens with no > payloads. > A workaround for this is to apply stemming first and only then run whatever > logic creates the payload, but this is not always convenient. > Other "consumer" filters have similar problem. > These filters can - and should - reuse the token, by implementing > next(Token), effectively also fixing the unwanted resetting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1356) Allow easy extensions of TopDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628163#action_12628163 ] Michael McCandless commented on LUCENE-1356: Doron is this one ready to go in? > Allow easy extensions of TopDocCollector > > > Key: LUCENE-1356 > URL: https://issues.apache.org/jira/browse/LUCENE-1356 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.3.3, 2.4 > > Attachments: 1356-2.patch, 1356.patch > > > TopDocCollector's members and constructor are declared either private or > package visible. It makes it hard to extend it as if you want to extend it > you can reuse its *hq* and *totatlHits* members, but need to define your own. > It also forces you to override getTotalHits() and topDocs(). > By changing its members and constructor (the one that accepts a PQ) to > protected, we allow users to extend it in order to get a different view of > 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() > and topDocs() method implementations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving SweetSpotSimilarity out of contrib
My thought was to move SSS to core as a step towards making it the default, if and when there is more evidence it is better than current default - it just felt right as a cautious step - I mean first move it to core so that it is more exposed and used, an only after a while, maybe, if there are mostly positive evidences, make it the default. On Thu, Sep 4, 2008 at 12:04 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > > On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote: > >> >> Obviously we can't default everything perfectly since at some point >> there are hard tradeoffs to be made and every app is different, but if >> SweetSpotSimilarity really gives better relevance for many/most apps, >> and doesn't have any downsides (I haven't looked closely myself), I >> think we should get it into core? >> > > Well, we only have 2 data points here: Hoss' original position that it was > helpful, and Doron's Million Query work. Has anyone else reported benefit? > And in that regard, the difference between OOTB and SweetSpot was 0.154 vs. > 0.162 for MAP. Not a huge amount, but still useful. In that regard, there > are other length normalization functions (namely approaches that don't favor > very short documents as much) that I've seen benefit applications as well, > but as Erik is (in)famous for saying "it depends". In fact, if we go solely > based on the million query work, we'd be better off having the Query Parser > create phrase queries automatically for any query w/ more than 1 term (0.19 > vs 0.154) before we even touch length normalization. > > I've long argued that Lucene needs to take on the relevance question more > head on, and in an open source way, until then, we are merely guessing at > what's better, w/o empirical evidence that can be easily reproduced. TREC > is just one data point, and is often discounted as being all that useful in > the real world. > > I'm on the fence, though. I agree w/ Hoss that core should be "core" and I > don't think we want to throw more and more into core, but I also agree w/ > Mike in that we want good, intelligent defaults for what we do have in core. > > -Grant > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Realtime Search for Social Networks Collaboration
Hi Yonik, The SOLR 2 list looks good. The question is, who is going to do the work? I tried to simplify the scope of Ocean as much as possible to make it possible (and slowly at that over time) for me to eventually finish what is mentioned on the wiki. I think SOLR is very cool and was major step forward when it came out. I also think it's got a lot of things now which makes integration difficult to do properly. I did try to integrate and received a lukewarm response and so decided to just move ahead separately until folks have time to collaborate. We probably should try to integrate SOLR and Ocean somehow however we may want to simply reduce the scope a bit and figure what is needed most, with the main use case being social networks. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. Facets were important, realtime was not needed because pricing doesn't change very often. I designed Ocean for social networks and actually further into the future realtime messaging based mobile applications. SOLR needs to be backward compatible and support it's existing user base. How do you plan on doing this for a SOLR 2 if the architecture is changed dramatically? SOLR solves a problem set that is very common making SOLR very useful in many situations. However I wanted Ocean to be like GData. So I wanted the scalability of Google which SOLR doesn't quite have yet, and the realtime, and then I figured the other stuff could be added later, stuff people seem to spend a lot of time on in the SOLR community currently (spellchecker, db imports, many others). I did use some of the SOLR terminology in building Ocean, like snapshots! But most of it is a digression. I tried to use schemas, but they just make the system harder to use. For distributed search I prefer serialized objects as this enables things like SpanQueries and payloads without writing request handlers and such. Also there is no need to write new request handlers and deploy (an expensive operation for systems that are in the 100s of servers) them as any new classes are simply dynamically loaded by the server from the client. A lot is now outlined on the wiki site http://wiki.apache.org/lucene-java/OceanRealtimeSearch now and there will be a lot more javadocs in the forthcoming patch. The latest code is also available all the time at http://oceansearch.googlecode.com/svn/trunk/trunk/oceanlucene I do welcome more discussion and if there are Solr developers who wish to work on Ocean feel free to drop me a line. Most of all though I think it would be useful for social networks interested in realtime search to get involved as it may be something that is difficult for one company to have enough resources to implement to a production level. I think this is where open source collaboration is particularly useful. Cheers, Jason Rutherglen [EMAIL PROTECTED] On Wed, Sep 3, 2008 at 4:56 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: >> I am wondering >> if there are social networks (or anyone else) out there who would be >> interested in collaborating with Apache on realtime search to get it >> to the point it can be used in production. > > Good timing Jason, I think you'll find some other people right here > at Apache (solr-dev) that want to collaborate in this area: > > http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html > > I've looked at your wiki briefly, and all the high level goals/features seem > to really be synergistic with where we are going with Solr2. > > -Yonik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1356) Allow easy extensions of TopDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628193#action_12628193 ] Doron Cohen commented on LUCENE-1356: - It is, applies cleanly and seems correct. Will commit as soon as tests complete. > Allow easy extensions of TopDocCollector > > > Key: LUCENE-1356 > URL: https://issues.apache.org/jira/browse/LUCENE-1356 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.3.3, 2.4 > > Attachments: 1356-2.patch, 1356.patch > > > TopDocCollector's members and constructor are declared either private or > package visible. It makes it hard to extend it as if you want to extend it > you can reuse its *hq* and *totatlHits* members, but need to define your own. > It also forces you to override getTotalHits() and topDocs(). > By changing its members and constructor (the one that accepts a PQ) to > protected, we allow users to extend it in order to get a different view of > 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() > and topDocs() method implementations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1356) Allow easy extensions of TopDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-1356. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Thanks Shai ! > Allow easy extensions of TopDocCollector > > > Key: LUCENE-1356 > URL: https://issues.apache.org/jira/browse/LUCENE-1356 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.3.3, 2.4 > > Attachments: 1356-2.patch, 1356.patch > > > TopDocCollector's members and constructor are declared either private or > package visible. It makes it hard to extend it as if you want to extend it > you can reuse its *hq* and *totatlHits* members, but need to define your own. > It also forces you to override getTotalHits() and topDocs(). > By changing its members and constructor (the one that accepts a PQ) to > protected, we allow users to extend it in order to get a different view of > 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() > and topDocs() method implementations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-989) Statistics from ValueSourceQuery
[ https://issues.apache.org/jira/browse/LUCENE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-989: --- Fix Version/s: (was: 2.4) 3.0 Assignee: (was: Doron Cohen) This should be look at with LUCENE-1085 - removing myself to not so others who can do it sooner. > Statistics from ValueSourceQuery > - > > Key: LUCENE-989 > URL: https://issues.apache.org/jira/browse/LUCENE-989 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.2 >Reporter: Will Johnson >Priority: Minor > Fix For: 3.0 > > Attachments: functionStats.patch > > > Patch forthcoming that adds a DocValuesStats object that is optionally > computer for a ValueSourceQuery. This ~replaces the getMin/Max/Avg from the > DocValues which were previously unaccessible without reasonably heavy > subclassing. in addition it add a few more stats and provides a single > object to encapsulate all statistics going forward. The stats object is tied > to the ValueSourceQuery so that the values can be cached without having to > maintain the full set of DocValues. Test and javadocs included. > - will -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1081) Remove the "Experimental" warnings from search.function package
[ https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1081: Fix Version/s: (was: 2.4) 3.0 Assignee: (was: Doron Cohen) Will depend on LUCENE-1085. > Remove the "Experimental" warnings from search.function package > --- > > Key: LUCENE-1081 > URL: https://issues.apache.org/jira/browse/LUCENE-1081 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4 >Reporter: Doron Cohen >Priority: Minor > Fix For: 3.0 > > > I am using this package for a while, seems that others in this list use it > too, so let's remove those warnings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1085) search.function should support all capabilities of Solr's search.function
[ https://issues.apache.org/jira/browse/LUCENE-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1085: Fix Version/s: (was: 2.4) 3.0 Assignee: (was: Doron Cohen) > search.function should support all capabilities of Solr's search.function > - > > Key: LUCENE-1085 > URL: https://issues.apache.org/jira/browse/LUCENE-1085 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Doron Cohen >Priority: Minor > Fix For: 3.0 > > > Lucene search.function does not allow Solr to move to use it, and so Solr > currently maintains its own version of this package. > Enhance Lucene's search.function so that Solr can move to use it, and avoid > this redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-989) Statistics from ValueSourceQuery
[ https://issues.apache.org/jira/browse/LUCENE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628197#action_12628197 ] doronc edited comment on LUCENE-989 at 9/3/08 4:38 PM: This should be look at with LUCENE-1085 - removing myself to not block others who can do it sooner. was (Author: doronc): This should be look at with LUCENE-1085 - removing myself to not so others who can do it sooner. > Statistics from ValueSourceQuery > - > > Key: LUCENE-989 > URL: https://issues.apache.org/jira/browse/LUCENE-989 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.2 >Reporter: Will Johnson >Priority: Minor > Fix For: 3.0 > > Attachments: functionStats.patch > > > Patch forthcoming that adds a DocValuesStats object that is optionally > computer for a ValueSourceQuery. This ~replaces the getMin/Max/Avg from the > DocValues which were previously unaccessible without reasonably heavy > subclassing. in addition it add a few more stats and provides a single > object to encapsulate all statistics going forward. The stats object is tied > to the ValueSourceQuery so that the values can be cached without having to > maintain the full set of DocValues. Test and javadocs included. > - will -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
Grant Ingersoll (JIRA) wrote: Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one. This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe. Grant, not sure what you mean by "b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one." Once we set replaceInvalidAcronym=true, then the type is set to HOST. However, if you were to revisit the grammar, then I would be interested to get in on the discussion on the behaviour of . For instance, if you have a document like "visit www.apache.org", you currently won't get a hit if you search for "apache". In an issue tracker like JIRA, we want to be able to search for "NullPointerException", and get a hit for the document "Application threw java.lang.NullPointerException". Also note that the current implementation has problems if the document doesn't contain expected whitespace. eg "I like Apache.They rock" Will get tokenized to the following: I like Apache.They rock I don't think there is a simple one-size-fits-all answer to how this should behave. It depends on the context of the app that is using Lucene. The best answer may be to make some of the behaviour configurable, or have a suite of specific analyzers? Mark. Most of the contributed Analyzers suffer from invalid recognition of acronyms. -- Key: LUCENE-1373 URL: https://issues.apache.org/jira/browse/LUCENE-1373 Project: Lucene - Java Issue Type: Bug Components: Analysis, contrib/analyzers Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end). Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us. StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy. Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :( I refer to: * BrazilianAnalyzer * CzechAnalyzer * DutchAnalyzer * FrenchAnalyzer * GermanAnalyzer * GreekAnalyzer * ThaiAnalyzer - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
Or just remove the generics, right? On Sep 3, 2008, at 5:09 PM, Karl Wettin (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628132 #action_12628132 ] Karl Wettin commented on LUCENE-1320: - OK. Either remove it or place it in some alternative contrib module? The first chooise is obviously the easiest. ShingleMatrixFilter, a three dimensional permutating shingle filter --- Key: LUCENE-1320 URL: https://issues.apache.org/jira/browse/LUCENE-1320 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Affects Versions: 2.3.2 Reporter: Karl Wettin Assignee: Karl Wettin Priority: Blocker Fix For: 2.4 Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt Backed by a column focused matrix that creates all permutations of shingle tokens in three dimensions. I.e. it handles multi token synonyms. Could for instance in some cases be used to replaces 0-slop phrase queries with something speedier. {code:java} Token[][][]{ {{hello}, {greetings, and, salutations}}, {{world}, {earth}, {tellus}} } {code} passes the following test with 2-3 grams: {code:java} assertNext(ts, "hello_world"); assertNext(ts, "greetings_and"); assertNext(ts, "greetings_and_salutations"); assertNext(ts, "and_salutations"); assertNext(ts, "and_salutations_world"); assertNext(ts, "salutations_world"); assertNext(ts, "hello_earth"); assertNext(ts, "and_salutations_earth"); assertNext(ts, "salutations_earth"); assertNext(ts, "hello_tellus"); assertNext(ts, "and_salutations_tellus"); assertNext(ts, "salutations_tellus"); {code} Contains more and less complex tests that demonstrate offsets, posincr, payload boosts calculation and construction of a matrix from a token stream. The matrix attempts to hog as little memory as possible by seeking no more than maximumShingleSize columns forward in the stream and clearing up unused resources (columns and unique token sets). Can still be optimized quite a bit though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
I think we should distinguish between what is a bug and what is an attempt of the tokenizer to produce a meaningful token. When the tokenizer outputs a HOST or ACRONYM token type, there's nothing that prevents you from putting a filter after the tokenizer that will use a UIMA Annotator (for example) and verify that the output token type is indeed correct. For example, in the case of java.lang.NullPointerException we all understand it's not a HOST, but unfortunately our logic hasn't been translated well into computer instructions, yet :-). However you treat this token now is up to you: - If you want to be able to search for the individual parts of the host, but still find the full host, I'd put a TokenFilter after the tokenizer that breaks the HOST to its parts and returns the parts along with the full host name. During query time I'd then remove that filter (i.e. create an Analyzer w/o that filter) and thus I'd be able to search for either "apache" or " www.apache.org". - If you want to actually verify the output HOST is indeed a host, again, put a TokenFilter after the tokenizer and either apply your own simple hueristics (for example if there's a ".com", ".org", ".net" it's a HOST, otherwise it's not - I know these don't cover all HOST types, it's just an example), or validate that with an external tool, like a UIMA Annotator. - You can also decide that a 2 parts HOST is not really a host, that way you solve the "I like Apache.They rock" problem, but miss a whole handful of hosts like "ibm.com", "apache.org", "google.com". Again, IMO, the logic in the tokenizer today for HOSTs and ACRONYMs are "best effort" to produce a meaningful token. If we remove those rules, for example, it'd be impossible to detect them because the tokenizer is set to discard any stand alone "&", ".", "@" for example. I'm going to send out another email to the list about a bug or incosistency I recently found in the COMPANY rule. I don't want to mix this thread with a different issue. On Thu, Sep 4, 2008 at 5:17 AM, Mark Lassau <[EMAIL PROTECTED]> wrote: > Grant Ingersoll (JIRA) wrote: > >> Of course, it's still a bit weird, b/c in your case the type value is >> going to be set to ACRONYM, when your example is clearly not one. This >> suggests to me that the grammar needs to be revisited, but that can wait >> until 3.0 I believe. >> >> >> > Grant, not sure what you mean by "b/c in your case the type value is going > to be set to ACRONYM, when your example is clearly not one." > Once we set replaceInvalidAcronym=true, then the type is set to HOST. > > However, if you were to revisit the grammar, then I would be interested to > get in on the discussion on the behaviour of . > For instance, if you have a document like "visit www.apache.org", you > currently won't get a hit if you search for "apache". > In an issue tracker like JIRA, we want to be able to search for > "NullPointerException", and get a hit for the document "Application threw > java.lang.NullPointerException". > > Also note that the current implementation has problems if the document > doesn't contain expected whitespace. > eg "I like Apache.They rock" > Will get tokenized to the following: > I > like > Apache.They > rock > > I don't think there is a simple one-size-fits-all answer to how this should > behave. It depends on the context of the app that is using Lucene. > The best answer may be to make some of the behaviour configurable, or have > a suite of specific analyzers? > > Mark. > >> Most of the contributed Analyzers suffer from invalid recognition of >>> acronyms. >>> >>> -- >>> >>>Key: LUCENE-1373 >>>URL: https://issues.apache.org/jira/browse/LUCENE-1373 >>>Project: Lucene - Java >>> Issue Type: Bug >>> Components: Analysis, contrib/analyzers >>> Affects Versions: 2.3.2 >>> Reporter: Mark Lassau >>> Priority: Minor >>> >>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like " >>> www.apache.org." would be incorrectly tokenized as an acronym (note the >>> dot at the end). >>> Unfortunately, keeping the "backward compatibility" of a bug turns out to >>> harm us. >>> StandardTokenizer has a couple of ways to indicate "fix this bug", but >>> unfortunately the default behaviour is still to be buggy. >>> Most of the non-English analyzers provided in lucene-analyzers utilize >>> the StandardTokenizer, and in v2.3.2 not one of these provides a way to get >>> the non-buggy behaviour :( >>> I refer to: >>> * BrazilianAnalyzer >>> * CzechAnalyzer >>> * DutchAnalyzer >>> * FrenchAnalyzer >>> * GermanAnalyzer >>> * GreekAnalyzer >>> * ThaiAnalyzer >>> >>> >> >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PR
Is the COMPANY rule in StandardTokenizer valid?
Hi The COMPANY rule in StandardTokenizer is defined like this: // Company names like AT&T and [EMAIL PROTECTED] COMPANY= {ALPHA} ("&"|"@") {ALPHA} While this works perfect for AT&T and [EMAIL PROTECTED], it doesn't work well for strings like widget&javascript&html. Now, the latter is obviously wrongly typed, and should have been separated by spaces, but that's what a user typed in a document, and now we need to treat it right (why don't they understand the rules of IR and tokenization?). Normally I wouldn't care and say this is one of the extreme cases, but unfortunately the tokenizer output two tokens: widget&javascript and html. Now that bothers me - the user can search for "html" and find the document, but not "javascript" or "widget", which is a bit harder to explain to users, even the intelligent ones. That got me thinking on whether this rule is properly defined, and what's the purpose of it. Obviously it's an attempt to not break legal company names on "&" and "@", but I'm not sure it covers all company name formats. For example, AT&T can be written as "AT & T" (with spaces) and I've also seen cases where it's written as ATT. While you could say "it's a best effort case", users don't buy that. Either you do something properly (doesn't have to be 100% accurate though), or you don't do it at all (I hope that doesn't sound too harsh). That way it's easy to explain to your users that you simply break on "&" or "@" (unless it's an email). They may not like it, but you'll at least be consistent. This rule slows StandardTokenizer's tokenization time, and eventually does not produce consistent results. If we think it's important to detect these tokens, then let's at least make it consistent by either: - changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing "AT&T", and "widget&javascript&html" as COMPANY. That at least will allow developers to put a CompanyTokenFilter (for example) after the tokenizer to break on "&" and "@" whenever there are more than 2 parts. We could also modify StandardFilter (which already handles ACRONYM) to handle COMPANY that way. - changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so that we recognize company names only if the pattern is followed by a space, dot, dash, underscore, exclamation mark or question mark. That'll still recognize AT&T, but won't recognize widget&javascript&html as COMPANY (which is good). What do you think? Shai