[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909145#comment-16909145 ] Itamar Syn-Hershko commented on LUCENE-8565: Heya - is this waiting for anything in particular that I can help in finalizing? Would really like to see this merged in. Thanks > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771783#comment-16771783 ] Itamar Syn-Hershko commented on LUCENE-8565: I'm not sure what the Lucene versioning policy about that would be; but we can always change the default flag to turn off field filtering support > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: SimpleQueryParser to support field filtering?
Anyone? -- Itamar Syn-Hershko CTO, Founder BigData Boutique <http://bigdataboutique.com/> Elasticsearch Consulting Partner Microsoft MVP | Lucene.NET PMC http://code972.com | @synhershko <https://twitter.com/synhershko> On Mon, Jan 14, 2019 at 10:19 AM Itamar Syn-Hershko wrote: > Hi all, > > I sent a PR back in November to resolve the title and would appreciate > feedback. > > Summary: > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems > like one can either get the parsed query to operate on a single field, or > on ALL defined fields (+ weights). No support for specifying `field:value` > in the query. > > It probably wasn't forgotten, but rather left out for simplicity, but > since we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > > JIRA: https://issues.apache.org/jira/browse/LUCENE-8565 > > PR: https://github.com/apache/lucene-solr/pull/498 > > What do people think? > > Cheers, > > -- > > Itamar Syn-Hershko > CTO, Founder > BigData Boutique <http://bigdataboutique.com/> > Elasticsearch Consulting Partner > http://code972.com | @synhershko <https://twitter.com/synhershko> > >
SimpleQueryParser to support field filtering?
Hi all, I sent a PR back in November to resolve the title and would appreciate feedback. Summary: SimpleQueryParser lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. JIRA: https://issues.apache.org/jira/browse/LUCENE-8565 PR: https://github.com/apache/lucene-solr/pull/498 What do people think? Cheers, -- Itamar Syn-Hershko CTO, Founder BigData Boutique <http://bigdataboutique.com/> Elasticsearch Consulting Partner http://code972.com | @synhershko <https://twitter.com/synhershko>
[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Summary: SimpleQueryParser to support field filtering (aka Add field:text operator) (was: SimpleQueryString to support field filtering (aka Add field:text operator)) > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8565) SimpleQueryParser to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Description: SimpleQueryParser lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. was: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. > SimpleQueryParser to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryParser lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686301#comment-16686301 ] Itamar Syn-Hershko commented on LUCENE-8565: PR submitted on github: [https://github.com/apache/lucene-solr/pull/498.] Reviews appreciated. > SimpleQueryString to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
[ https://issues.apache.org/jira/browse/LUCENE-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-8565: --- Description: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be useful to have it in. Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. was: SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. > SimpleQueryString to support field filtering (aka Add field:text operator) > -- > > Key: LUCENE-8565 > URL: https://issues.apache.org/jira/browse/LUCENE-8565 > Project: Lucene - Core > Issue Type: Improvement > Components: core/queryparser > Reporter: Itamar Syn-Hershko >Priority: Minor > > SimpleQueryString lacks support for the `field:` operator for creating > queries which operate on fields other than the default field. Seems like one > can either get the parsed query to operate on a single field, or on ALL > defined fields (+ weights). No support for specifying `field:value` in the > query. > It probably wasn't forgotten, but rather left out for simplicity, but since > we are using this QP implementation more and more (mostly through > Elasticsearch) we thought it would be useful to have it in. > Seems like this is not too hard to pull off and I'll be happy to contribute a > patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8565) SimpleQueryString to support field filtering (aka Add field:text operator)
Itamar Syn-Hershko created LUCENE-8565: -- Summary: SimpleQueryString to support field filtering (aka Add field:text operator) Key: LUCENE-8565 URL: https://issues.apache.org/jira/browse/LUCENE-8565 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Itamar Syn-Hershko SimpleQueryString lacks support for the `field:` operator for creating queries which operate on fields other than the default field. Seems like one can either get the parsed query to operate on a single field, or on ALL defined fields (+ weights). No support for specifying `field:value` in the query. It probably wasn't forgotten, but rather left out for simplicity, but since we are using this QP implementation more and more (mostly through Elasticsearch) we thought it would be Seems like this is not too hard to pull off and I'll be happy to contribute a patch for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
[ https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338677#comment-14338677 ] Itamar Syn-Hershko commented on LUCENE-6302: Sent a PR for the latter https://github.com/apache/lucene-solr/pull/129 Adding Date Math support to Lucene Expressions module - Key: LUCENE-6302 URL: https://issues.apache.org/jira/browse/LUCENE-6302 Project: Lucene - Core Issue Type: Improvement Components: modules/expressions Affects Versions: 4.10.3 Reporter: Itamar Syn-Hershko Lucene Expressions are great, but they don't allow for date math. More specifically, they don't allow to infer date parts from a numeric representation of a date stamp, nor they allow to parse strings representations to dates. Some of the features requested here easy to implement via ValueSource implementation (and potentially minor changes to the lexer definition) , some are more involved. I'll be happy if we could get half of those in, and will be happy to work on a PR for the parts we can agree on. The items we will be happy to have: - A now() function (with or without TZ support) to return a current long date/time value as numeric, that we could use against indexed datetime fields (which are infact numerics) - Parsing methods - to allow to express datetime as strings, and / or read it from stored fields and parse it from there. Parse errors would render a value of zero. - Given a numeric value, allow to specify it is a date value and then infer date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - Date(1424963520).Year. Basically methods which return numerics but internally create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
[ https://issues.apache.org/jira/browse/LUCENE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338563#comment-14338563 ] Itamar Syn-Hershko commented on LUCENE-6302: I actually expected the main objection would be to adding date parsing methods :) Maybe it would make sense to explain the use cases this is trying to solve. We are using Elasticsearch Kibana and since the latest version switched to using Lucene Expressions (from Groovy) we found ourselves blocked by the things we can do with Kibana's scripted fields For example, given a user's DOB, how can we do aggregations on their age? or compute how many years (or days) have passed between 2 given days? Yes we can subtract the epochs (except that it doesn't seem to work https://github.com/elasticsearch/elasticsearch/issues/9884) but translating the result to terms of days, hours or years is even uglier using an expression. I think introducing ValueSources to do this should be enough, but if changing the lexer will be the preferred way I can go and do that as well. With regards to syntax - I'm not locked on any preferred syntax. Either way it seems like adding a now() function is the easiest change and can send a PR with this change alone to start with. Adding Date Math support to Lucene Expressions module - Key: LUCENE-6302 URL: https://issues.apache.org/jira/browse/LUCENE-6302 Project: Lucene - Core Issue Type: Improvement Components: modules/expressions Affects Versions: 4.10.3 Reporter: Itamar Syn-Hershko Lucene Expressions are great, but they don't allow for date math. More specifically, they don't allow to infer date parts from a numeric representation of a date stamp, nor they allow to parse strings representations to dates. Some of the features requested here easy to implement via ValueSource implementation (and potentially minor changes to the lexer definition) , some are more involved. I'll be happy if we could get half of those in, and will be happy to work on a PR for the parts we can agree on. The items we will be happy to have: - A now() function (with or without TZ support) to return a current long date/time value as numeric, that we could use against indexed datetime fields (which are infact numerics) - Parsing methods - to allow to express datetime as strings, and / or read it from stored fields and parse it from there. Parse errors would render a value of zero. - Given a numeric value, allow to specify it is a date value and then infer date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - Date(1424963520).Year. Basically methods which return numerics but internally create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6302) Adding Date Math support to Lucene Expressions module
Itamar Syn-Hershko created LUCENE-6302: -- Summary: Adding Date Math support to Lucene Expressions module Key: LUCENE-6302 URL: https://issues.apache.org/jira/browse/LUCENE-6302 Project: Lucene - Core Issue Type: Improvement Components: modules/expressions Affects Versions: 4.10.3 Reporter: Itamar Syn-Hershko Lucene Expressions are great, but they don't allow for date math. More specifically, they don't allow to infer date parts from a numeric representation of a date stamp, nor they allow to parse strings representations to dates. Some of the features requested here easy to implement via ValueSource implementation (and potentially minor changes to the lexer definition) , some are more involved. I'll be happy if we could get half of those in, and will be happy to work on a PR for the parts we can agree on. The items we will be happy to have: - A now() function (with or without TZ support) to return a current long date/time value as numeric, that we could use against indexed datetime fields (which are infact numerics) - Parsing methods - to allow to express datetime as strings, and / or read it from stored fields and parse it from there. Parse errors would render a value of zero. - Given a numeric value, allow to specify it is a date value and then infer date parts - e.g. Date(1424963520).Year == 2015, or Date(now()) - Date(1424963520).Year. Basically methods which return numerics but internally create and use Date objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FSDirectory and creating directory
Thanks guys, we will mimic the current behavior and ignore the comment. Mike I did promise to find bugs! -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Lucene.NET committer and PMC member On Wed, Feb 4, 2015 at 11:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi Mike, This is why I ask here! So I think we should fix this before release of 5.0! Maybe Robert has an explanation why he does the createDirectories() on ctor. In any case I will now commit the removal of the bogus comment in 4.10 branch. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, February 04, 2015 10:07 AM To: Lucene/Solr dev Cc: d...@lucenenet.apache.org Subject: Re: FSDirectory and creating directory In the past we considered this (mkdir when creating FSDir) a bug: https://issues.apache.org/jira/browse/LUCENE-1464 Mike McCandless http://blog.mikemccandless.com On Wed, Feb 4, 2015 at 4:03 AM, Uwe Schindler uschind...@apache.org wrote: Hi, on the Lucene.NET mailing list there were some issues with porting over Lucene 4.8's FSDirectory class to .NET. In fact the following comment on a method caused confusion: // returns the canonical version of the directory, creating it if it doesn't exist. private static File getCanonicalPath(File file) throws IOException { return new File(file.getCanonicalPath()); } In fact, the comment is not correct (and the whole method is useless - one could call file.getCanonicalFile() to do the same. According to Javadocs and my tests, this method does *not* generate the directory. If the directory does not exists, it just returns a synthetic canonical name (modifying only known parts of the path). In fact we should maybe fix this comment and remove this method in 4.10.x (if we get a further bugfix release). We also have a test that validates that a directory is not created by FSDirectory's ctor (a side effect of some IndexWriter test). Nevertheless, in Lucene 5 we changed the behavior of the FSDirectory CTOR with NIO.2: protected FSDirectory(Path path, LockFactory lockFactory) throws IOException { super(lockFactory); Files.createDirectories(path); // create directory, if it doesn't exist directory = path.toRealPath(); } The question is now: Do we really intend to create the directory in Lucene 5 ? What about opening an IndexReader on a non-existent directory on a read- only filesystem? I know that Robert added this to make path.getRealPath() to work correctly? I just want to discuss this before we release 5.0. To me it sounds wrong to create the directory in the constructor... Uwe - Uwe Schindler uschind...@apache.org Apache Lucene PMC Member / Committer Bremen, Germany http://lucene.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FSDirectory and creating directory
Rob, what is the intended behavior, and what is the reasoning behind it? Doesn't this affect only attempts to open a non-existent index directory - and whether or not there will be an empty folder left behind? -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Lucene.NET committer and PMC member On Wed, Feb 4, 2015 at 2:45 PM, Robert Muir rcm...@gmail.com wrote: Personally, I am completely against changing this for 5.0 This is the worst possible thing you can do, it will trickle into more bugs in lockfactory etc. Please don't make this last minute risky change. it has no benefits and will only cause bugs. On Wed, Feb 4, 2015 at 7:44 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Feb 4, 2015 at 4:03 AM, Uwe Schindler uschind...@apache.org wrote: The question is now: Do we really intend to create the directory in Lucene 5 ? What about opening an IndexReader on a non-existent directory on a read-only filesystem? I know that Robert added this to make path.getRealPath() to work correctly? I just want to discuss this before we release 5.0. To me it sounds wrong to create the directory in the constructor... Please dont call this a bug until you understand why the change was made. Please, read the behavior of getCanonicalPath and understand exactly why and how it fails: and its this nonexistent case. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241214#comment-14241214 ] Itamar Syn-Hershko commented on LUCENE-6103: Maybe out of scope of this ticket, but how do we go about #2? will be happy to take this discussion offline as well StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241306#comment-14241306 ] Itamar Syn-Hershko commented on LUCENE-6103: Sent them a request. I'll buy Robert beers if that could help pushing this forward! StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6103) StandardTokenizer doesn't tokenizer word:word
Itamar Syn-Hershko created LUCENE-6103: -- Summary: StandardTokenizer doesn't tokenizer word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-6103: --- Summary: StandardTokenizer doesn't tokenize word:word (was: StandardTokenizer doesn't tokenizer word:word) StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5997) StandardFilter redundant
[ https://issues.apache.org/jira/browse/LUCENE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239697#comment-14239697 ] Itamar Syn-Hershko commented on LUCENE-5997: Sounds good! StandardFilter redundant Key: LUCENE-5997 URL: https://issues.apache.org/jira/browse/LUCENE-5997 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.10.1 Reporter: Itamar Syn-Hershko Priority: Trivial Any reason why StandardFilter is still around? its just a no-op class now: @Override public final boolean incrementToken() throws IOException { return input.incrementToken(); // TODO: add some niceties for the new grammar } https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5723) Performance improvements for FastCharStream
[ https://issues.apache.org/jira/browse/LUCENE-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239728#comment-14239728 ] Itamar Syn-Hershko commented on LUCENE-5723: Reported as https://java.net/jira/browse/JAVACC-285 Performance improvements for FastCharStream --- Key: LUCENE-5723 URL: https://issues.apache.org/jira/browse/LUCENE-5723 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Itamar Syn-Hershko Priority: Minor Hello from the .NET land, A user of ours has identified an optimization opportunity, although minor I think it points to a valid point - we should avoid using exceptions from controlling flow when possible. Here's the original ticket + commits to our codebase. If this looks valid to you too I can go ahead and prepare a PR. https://issues.apache.org/jira/browse/LUCENENET-541 https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239784#comment-14239784 ] Itamar Syn-Hershko commented on LUCENE-6103: Yes, I figured it will be down to some Unicode rules. Can you give a rationale for this, mainly out of curiosity? Not a Unicode expert, but I'd assume just like you wouldn't want English words to not-break on Hebrew Punctuation Gershayim (e.g. TestWord is actually 2 tokens and מנכלים is one), maybe this rule is meant for specific scenarios and not for the general use case? On another note, any type of Gershayim should be preserved within Hebrew words, not only U+05F4. This is mainly because keyboards and editors used produce the standard character in most cases. I had a chat with Robert a while back where he said that's the case, I'm just making sure you didn't follow the specs to the letter in that regard... StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240090#comment-14240090 ] Itamar Syn-Hershko commented on LUCENE-6103: Good stuff, thanks Steve. I'm going to see if the rest of the UAX is good for us, and if so see if I can comply with the 6.2.5 version of the specs. It's a good thing StandardTokenizer is no longer English centric, but I cannot imagine what use the colon has especially since in most cases it is not something reasonable :) StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240133#comment-14240133 ] Itamar Syn-Hershko commented on LUCENE-6103: Ok so I did some homework. In swedish, connect is a way to shortcut writings of words. So C:a is infact cirka which means approximately. I guess it can be thought of as English acronyms, only apparently its way less commonly used in Swedish (my source says very very seldomly used; old style and not used in modern Swedish at all). Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a but not c:ka). And also, the affects it has are quite severe at the moment - 2 words with a colon in between that didn't have space will be outputted as one token even though its 100% its not applicable to Swedish, since each words has 2 characters. I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but: 1. Given the above, does it really make sense to use this tokenizer in all language-specific analyzers as well? e.g. https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105 I'd think for language specific analyzers we'd want tokenizers aiming for this language with limited support for others. So, in this case, colon will always be considered a tokenizing char. 2. Can we change the jflex definition to at least limit the effects of this, e.g. only support colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word is not? StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
[ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240392#comment-14240392 ] Itamar Syn-Hershko commented on LUCENE-6103: 0. You mean it implements UAX#29 version 6.3 :) 1. I'll likely be sending a PR for #1 sometime soon. Would you recommend using UAX#29 minus specific non-English tweaks, or fall back to ClassicStandardTokenizer which is English specific, or something else? 2. Here's the thing: the standard is wrong, or buggy. Ask any Swedish and they will tell you, and any non-Swedish corpus wouldn't care. And basically this is a bug in every Lucene based system today because of the word:word scenario; its a bit of an edge case but I bet I can find multiple occurrences in every big enough system. What can we do about that? We already solved this using char filters, converting colons to a comma. It feels a bit hacky though, and again - this _is_ a flaw in Lucene's analysis even though it conforms to a Unicode standard. StandardTokenizer doesn't tokenize word:word Key: LUCENE-6103 URL: https://issues.apache.org/jira/browse/LUCENE-6103 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.9 Reporter: Itamar Syn-Hershko Assignee: Steve Rowe StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API: localhost:9200/_analyze?tokenizer=standardtext=word%20word:word If this is the intended behavior, then why? I can't really see the logic behind it. If not, I'll be happy to join in the effort of fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
JFlex, tokenization, and custom token exceptions
Hey all, I posted this question also to the JFlex[1] list as it seems a more appropriate place, but I thought I should raise this here as well. I'm looking for ways to use Lucene's tokenizers, but preserve some custom tokens defined by the user. For example, use StandardTokenizer but preserve C++, C# and i-phone as whole tokens. The gotcha here is I want that list to be loaded on runtime, and not compiled into the tokenizer - mainly because it will change over time. The problem is there's no real way of doing this currently. While I had implemented this myself, JFlex doesn't seem to support this (other than defining new macros and regenerating the Java pieces, recompiling etc). I discussed this with Rob Muir a couple of months back and he seemed interested, will be happy to see if there's interest in pursuing this, or get any new ideas on how to enable this more easily on the JFlex layer or otherwise. I'll be happy to take this on but every approach I'm looking at currently has some significant flaws. Cheers, [1] http://sourceforge.net/p/jflex/mailman/jflex-users/?viewmonth=201411 -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Author of RavenDB in Action http://manning.com/synhershko/
[jira] [Created] (LUCENE-5997) StandardFilter redundant
Itamar Syn-Hershko created LUCENE-5997: -- Summary: StandardFilter redundant Key: LUCENE-5997 URL: https://issues.apache.org/jira/browse/LUCENE-5997 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.10.1 Reporter: Itamar Syn-Hershko Priority: Trivial Any reason why StandardFilter is still around? its just a no-op class now: @Override public final boolean incrementToken() throws IOException { return input.incrementToken(); // TODO: add some niceties for the new grammar } https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4601) ivy availability check isn't quite right
[ https://issues.apache.org/jira/browse/LUCENE-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035885#comment-14035885 ] Itamar Syn-Hershko commented on LUCENE-4601: May not be directly related, but I just tried running this: http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ on OSX Mavericks, with ant and ivy both installed via homebrew. Ivy was not found by and idea even when I placed a manually downloaded jar locally myself. I had to run ivy-bootstrap to get off the ground - maybe it worths adding that to the docs ivy availability check isn't quite right Key: LUCENE-4601 URL: https://issues.apache.org/jira/browse/LUCENE-4601 Project: Lucene - Core Issue Type: Bug Components: general/build Reporter: Robert Muir Fix For: 4.1, 5.0 Attachments: LUCENE-4601.patch remove ivy from your .ant/lib but load it up on a build file like so: You have to lie to lucene's build, overriding ivy.available, because for some reason the detection is wrong and will tell you ivy is not available, when it actually is. I tried changing the detector to use available classname=some.ivy.class and this didnt work either... so I don't actually know what the correct fix is. {noformat} project name=test default=test basedir=. path id=ivy.lib.path fileset dir=/Users/rmuir includes=ivy-2.2.0.jar / /path taskdef resource=org/apache/ivy/ant/antlib.xml uri=antlib:org.apache.ivy.ant classpathref=ivy.lib.path / target name=test subant target=test inheritAll=false inheritRefs=false failonerror=true fileset dir=lucene-trunk/lucene includes=build.xml/ !-- lie -- property name=ivy.available value=true/ /subant /target /project {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035978#comment-14035978 ] Itamar Syn-Hershko commented on LUCENE-2841: Can anyone review and comment? CommonGramsFilter improvements -- Key: LUCENE-2841 URL: https://issues.apache.org/jira/browse/LUCENE-2841 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1, 4.0-ALPHA Reporter: Steve Rowe Priority: Minor Fix For: 4.9, 5.0 Attachments: commit-6402a55.patch Currently CommonGramsFilter expects users to remove the common words around which output token ngrams are formed, by appending a StopFilter to the analysis pipeline. This is inefficient in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has to be re-examined by the following StopFilter. The current ctor should be deprecated, and another ctor added with a boolean option controlling whether the common words should be output as unigrams. If common words *are* configured to be output as unigrams, captureState() will still need to be called, as it is now. If the common words are *not* configured to be output as unigrams, rather than calling captureState() for the trailing token in each output token ngram, the term text, position and offset can be maintained in the same way as they are now for the leading token: using a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd. The user then no longer would need to append a StopFilter to the analysis chain. An example illustrating both possibilities should also be added. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5723) Performance improvements for FastCharStream
Itamar Syn-Hershko created LUCENE-5723: -- Summary: Performance improvements for FastCharStream Key: LUCENE-5723 URL: https://issues.apache.org/jira/browse/LUCENE-5723 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Itamar Syn-Hershko Priority: Minor Hello from the .NET land, A user of ours has identified an optimization opportunity, although minor I think it points to a valid point - we should avoid using exceptions from controlling flow when possible. Here's the original ticket + commits to our codebase. If this looks valid to you too I can go ahead and prepare a PR. https://issues.apache.org/jira/browse/LUCENENET-541 https://github.com/apache/lucene.net/commit/ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 https://git-wip-us.apache.org/repos/asf?p=lucenenet.git;a=blobdiff;f=src/core/QueryParser/QueryParserTokenManager.cs;h=ec09c8e451f7a7d1572fbdce4c7598e362526a7c;hp=17583d20f660fdb6e4aa86105c7574383f965ebe;hb=41ebbc2d;hpb=ac8c9fa809110ddb180bf7b2ce93e86270b39ff6 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ICUFoldingFilter obsolete?
This makes sense, thanks Rob -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Author of RavenDB in Action http://manning.com/synhershko/ On Sun, Mar 2, 2014 at 3:54 PM, Robert Muir rcm...@gmail.com wrote: I use it too, its fine. Its just not really standardized, and never was :) that UTR had that status when i wrote it! On Sun, Mar 2, 2014 at 8:52 AM, Shawn Heisey s...@elyograg.org wrote: On 3/2/2014 6:37 AM, Robert Muir wrote: It was always this way. i don't think such kinds of normalization should be standards either (what this stuff is doing is heuristical in nature). I use ICUFoldingFilterFactory in my Solr schema, with the idea that it's a smart and single-pass way to fold and lowercase. Is support from IBM and Lucene expected to continue, or should I be looking for another solution? Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
ICUFoldingFilter obsolete?
Hi all, I may have missed the train on this, but what is the status of ICUFoldingFilter? Documentation suggests it follows foldings specified in UTR#30 ( http://lucene.apache.org/core/4_6_1/analyzers-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html), but UTR#30 is a draft that was later withdrawn ( http://www.unicode.org/reports/tr30/). I'm not up-to-date with the greatest and latest in the Unicode world so I'm not sure why it was withdrawn, but given the delicacy of term normalization I suppose this worth revisiting? Thanks, -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Author of RavenDB in Action http://manning.com/synhershko/
[jira] [Created] (LUCENE-5358) Code cleanup on KStemmer
Itamar Syn-Hershko created LUCENE-5358: -- Summary: Code cleanup on KStemmer Key: LUCENE-5358 URL: https://issues.apache.org/jira/browse/LUCENE-5358 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.6, 4.5.1, 4.5, 3.0 Reporter: Itamar Syn-Hershko Priority: Minor This affects all versions with KStemmer in them The code of KStemmer needs some intensive cleanup, just to give you some idea on something that immediately popped up: https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemmer.java#L283-286 I'll be happy to do this myself, just wanted to check in advance to see if this is something you'd consider accepting in -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields
Itamar Syn-Hershko created LUCENE-5011: -- Summary: MemoryIndex and FVH don't play along with multi-value fields Key: LUCENE-5011 URL: https://issues.apache.org/jira/browse/LUCENE-5011 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Itamar Syn-Hershko When multi-value fields are indexed to a MemoryIndex, positions are computed correctly on search but the start and end offsets and the values array index aren't correct. Comparing the same execution path for IndexReader on a Directory impl and MemoryIndex (same document, same query, same analyzer, different Index impl), the difference first shows in FieldTermStack.java line 125: termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), pos, weight ) ); dpEnum.startOffset() and dpEnum.endOffset don't match between implementations. This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized multi-value fields all too well when positions and offsets are required. I should also mention we are using an Analyzer which outputs several tokens at a position (a la SynonymFilter), but I don't believe this is related. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5011) MemoryIndex and FVH don't play along with multi-value fields
[ https://issues.apache.org/jira/browse/LUCENE-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662950#comment-13662950 ] Itamar Syn-Hershko commented on LUCENE-5011: The actual test case we have now is very tightly coupled with ElasticSearch and our custom analysis chain, it may take me some time to decouple it into a stand-alone Lucene test. Alternatively, I'll be happy to work this out with you via Skype using our existing test case. MemoryIndex and FVH don't play along with multi-value fields Key: LUCENE-5011 URL: https://issues.apache.org/jira/browse/LUCENE-5011 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Itamar Syn-Hershko When multi-value fields are indexed to a MemoryIndex, positions are computed correctly on search but the start and end offsets and the values array index aren't correct. Comparing the same execution path for IndexReader on a Directory impl and MemoryIndex (same document, same query, same analyzer, different Index impl), the difference first shows in FieldTermStack.java line 125: termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), pos, weight ) ); dpEnum.startOffset() and dpEnum.endOffset don't match between implementations. This looks like a bug in MemoryIndex, which doesn't seem to handle tokenized multi-value fields all too well when positions and offsets are required. I should also mention we are using an Analyzer which outputs several tokens at a position (a la SynonymFilter), but I don't believe this is related. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace
Itamar Syn-Hershko created LUCENE-4673: -- Summary: TermQuery.toString() doesn't play nicely with whitespace Key: LUCENE-4673 URL: https://issues.apache.org/jira/browse/LUCENE-4673 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 3.6.2, 4.0-BETA, 4.1 Reporter: Itamar Syn-Hershko A TermQuery where term.text() contains whitespace outputs incorrect string representation: field:foo bar instead of field:foo bar. A correct representation is such that could be parsed again to the correct Query object (using the correct analyzer, yes, but still). This may not be so critical, but in our system we use Lucene's QP to parse and then pre-process and optimize user queries. To do that we use Query.toString on some clauses to rebuild the query string. This can be easily resolved by always adding quote marks before and after the term text in TermQuery.toString. Testing to see if they are required or not is too much work and TermQuery is ignorant of quote marks anyway. Some other scenarios which could benefit from this change is places where escaped characters are used, such as URLs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4673) TermQuery.toString() doesn't play nicely with whitespace
[ https://issues.apache.org/jira/browse/LUCENE-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548874#comment-13548874 ] Itamar Syn-Hershko commented on LUCENE-4673: I figured as much, yet we would definitely like to have use this behavior built-in. Are there any plans on making such an interface to perform a proper Query - String conversion? TermQuery.toString() doesn't play nicely with whitespace Key: LUCENE-4673 URL: https://issues.apache.org/jira/browse/LUCENE-4673 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.0-BETA, 4.1, 3.6.2 Reporter: Itamar Syn-Hershko A TermQuery where term.text() contains whitespace outputs incorrect string representation: field:foo bar instead of field:foo bar. A correct representation is such that could be parsed again to the correct Query object (using the correct analyzer, yes, but still). This may not be so critical, but in our system we use Lucene's QP to parse and then pre-process and optimize user queries. To do that we use Query.toString on some clauses to rebuild the query string. This can be easily resolved by always adding quote marks before and after the term text in TermQuery.toString. Testing to see if they are required or not is too much work and TermQuery is ignorant of quote marks anyway. Some other scenarios which could benefit from this change is places where escaped characters are used, such as URLs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements
[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539310#comment-13539310 ] Itamar Syn-Hershko commented on LUCENE-2841: Attached is a patch to fix this, including tests. There is no regression, and the new behavior when keepOrig is set to true is as described in the comments here. The only thing I wasn't sure about was CommonGramsQueryFilter - should it be deprecated? or how should it be made to work with this change? CommonGramsFilter improvements -- Key: LUCENE-2841 URL: https://issues.apache.org/jira/browse/LUCENE-2841 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 3.1, 4.0-ALPHA Reporter: Steven Rowe Priority: Minor Fix For: 4.1 Attachments: commit-6402a55.patch Currently CommonGramsFilter expects users to remove the common words around which output token ngrams are formed, by appending a StopFilter to the analysis pipeline. This is inefficient in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has to be re-examined by the following StopFilter. The current ctor should be deprecated, and another ctor added with a boolean option controlling whether the common words should be output as unigrams. If common words *are* configured to be output as unigrams, captureState() will still need to be called, as it is now. If the common words are *not* configured to be output as unigrams, rather than calling captureState() for the trailing token in each output token ngram, the term text, position and offset can be maintained in the same way as they are now for the leading token: using a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd. The user then no longer would need to append a StopFilter to the analysis chain. An example illustrating both possibilities should also be added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: pro coding style
In the past git had bad tooling, that is not the case today. I've been using git also without github screens - and while they definitely add a lot, it is still ten times more usable than SVN. As I told the Lucene.NET mailing list, you should all watch the following video and give git a few days of your time before continuing with this discussion: http://www.youtube.com/watch?v=4XpnKHJAok8 Also, Apache mirrors to github, so basically you work against github all the time On Fri, Nov 30, 2012 at 4:15 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Nov 30, 2012 at 9:10 AM, Mark Miller markrmil...@gmail.comwrote: On Nov 30, 2012, at 8:56 AM, Robert Muir rcm...@gmail.com wrote: but git by itself, is pretty unusable. Given the number of committers that eat some pain to use git when developing lucene/solr, and have no github or pull requests, I'm not sure that's a common though :) Sure, some people might disagree with me. I'm more than willing to eat some pain if it makes contributions easier. I just feel like a lot of what makes github successful is unfortunately actually in github and not git. Its like if your development team is screaming for linux machines. You have to be careful how to interpret that. If you hand them a bunch of machines with just linux kernels, they probably won't be productive. When they scream for linux they want a userland with a shell, compiler, X-windows, editor and so on too.
[jira] [Commented] (LUCENE-4208) Spatial distance relevancy should use score of 1/distance
[ https://issues.apache.org/jira/browse/LUCENE-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13451430#comment-13451430 ] Itamar Syn-Hershko commented on LUCENE-4208: What's the status of this? are query results being properly sorted based on distance? Spatial distance relevancy should use score of 1/distance - Key: LUCENE-4208 URL: https://issues.apache.org/jira/browse/LUCENE-4208 Project: Lucene - Core Issue Type: New Feature Components: modules/spatial Reporter: David Smiley Fix For: 4.0 The SpatialStrategy.makeQuery() at the moment uses the distance as the score (although some strategies -- TwoDoubles if I recall might not do anything which would be a bug). The distance is a poor value to use as the score because the score should be related to relevancy, and the distance itself is inversely related to that. A score of 1/distance would be nice. Another alternative is earthCircumference/2 - distance, although I like 1/distance better. Maybe use a different constant than 1. Credit: this is Chris Male's idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4186) Lucene spatial's distErrPct is treated as a fraction, not a percent.
[ https://issues.apache.org/jira/browse/LUCENE-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447037#comment-13447037 ] Itamar Syn-Hershko commented on LUCENE-4186: distErrPct makes sense to me - it makes more sense to talk about the expected error rate rather than actual given precision. Hence the name Distance Error Percentage makes perfect sense, although is tough to make an acronym of... And while at it throw a bug fix in: SpatialArgs.toString should multiply distPrecision by 100, not divide it. Lucene spatial's distErrPct is treated as a fraction, not a percent. -- Key: LUCENE-4186 URL: https://issues.apache.org/jira/browse/LUCENE-4186 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Reporter: David Smiley Assignee: David Smiley Priority: Critical Fix For: 4.0 The distance-error-percent of a query shape in Lucene spatial is, in a nutshell, the percent of the shape's area that is an error epsilon when considering search detail at its edges. The default is 2.5%, for reference. However, as configured, it is read in as a fraction: {code:xml} fieldType name=location_2d_trie class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDetailDist=0.001 / {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
[ https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445807#comment-13445807 ] Itamar Syn-Hershko commented on LUCENE-4342: I can confirm this is fixed now. Thanks for the fast turnaround! Issues with prefix tree's Distance Error Percentage Key: LUCENE-4342 URL: https://issues.apache.org/jira/browse/LUCENE-4342 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Affects Versions: 4.0-ALPHA, 4.0-BETA Reporter: Itamar Syn-Hershko Assignee: David Smiley Fix For: 4.0 Attachments: LUCENE-4342_fix_distance_precision_lookup_for_prefix_trees,_and_modify_the_default_algorit.patch, unnamed.patch See attached patch for a failing test Basically, it's a simple point and radius scenario that works great as long as args.setDistPrecision(0.0); is called. Once the default precision is used (2.5%), it doesn't work as expected. The distance between the 2 points in the patch is 35.75 KM. Taking into account the 2.5% error the effective radius without false negatives/positives should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
Itamar Syn-Hershko created LUCENE-4342: -- Summary: Issues with prefix tree's Distance Error Percentage Key: LUCENE-4342 URL: https://issues.apache.org/jira/browse/LUCENE-4342 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Affects Versions: 4.0-BETA, 4.0-ALPHA Reporter: Itamar Syn-Hershko Attachments: unnamed.patch See attached patch for a failing test Basically, it's a simple point and radius scenario that works great as long as args.setDistPrecision(0.0); is called. Once the default precision is used (2.5%), it doesn't work as expected. The distance between the 2 points in the patch is 35.75 KM. Taking into account the 2.5% error the effective radius without false negatives/positives should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4342) Issues with prefix tree's Distance Error Percentage
[ https://issues.apache.org/jira/browse/LUCENE-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itamar Syn-Hershko updated LUCENE-4342: --- Attachment: unnamed.patch A failing test Issues with prefix tree's Distance Error Percentage Key: LUCENE-4342 URL: https://issues.apache.org/jira/browse/LUCENE-4342 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Affects Versions: 4.0-ALPHA, 4.0-BETA Reporter: Itamar Syn-Hershko Attachments: unnamed.patch See attached patch for a failing test Basically, it's a simple point and radius scenario that works great as long as args.setDistPrecision(0.0); is called. Once the default precision is used (2.5%), it doesn't work as expected. The distance between the 2 points in the patch is 35.75 KM. Taking into account the 2.5% error the effective radius without false negatives/positives should be around 34.8 KM. This test fails with a radius of 33 KM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1375282 - /incubator/lucene.net/trunk/src/core/Util/Parameter.cs
This will probably require releasing the core again as well as a new RC... The spatial module was updated, still doing some integration tests, will send more updates soon On Tue, Aug 21, 2012 at 1:14 AM, synhers...@apache.org wrote: Author: synhershko Date: Mon Aug 20 22:14:01 2012 New Revision: 1375282 URL: http://svn.apache.org/viewvc?rev=1375282view=rev Log: Fixing a possible NRE which can be thrown during a race condition on accessing allParameters This is not an air-tight solution, as an ArgumentException can still be thrown. I don't care much about doing this within a lock as it will never be a bottleneck. https://groups.google.com/group/ravendb/browse_thread/thread/a5cf07e80f70c856 Modified: incubator/lucene.net/trunk/src/core/Util/Parameter.cs Modified: incubator/lucene.net/trunk/src/core/Util/Parameter.cs URL: http://svn.apache.org/viewvc/incubator/lucene.net/trunk/src/core/Util/Parameter.cs?rev=1375282r1=1375281r2=1375282view=diff == --- incubator/lucene.net/trunk/src/core/Util/Parameter.cs (original) +++ incubator/lucene.net/trunk/src/core/Util/Parameter.cs Mon Aug 20 22:14:01 2012 @@ -39,11 +39,13 @@ namespace Lucene.Net.Util // typesafe enum pattern, no public constructor this.name = name; string key = MakeKey(name); - - if (allParameters.ContainsKey(key)) - throw new System.ArgumentException(Parameter name + key + already used!); - - allParameters[key] = this; + + lock (allParameters) + { + if (allParameters.ContainsKey(key)) + throw new System.ArgumentException(Parameter name + key + already used!); + allParameters[key] = this; + } } private string MakeKey(string name)
Re: svn commit: r1375282 - /incubator/lucene.net/trunk/src/core/Util/Parameter.cs
That won't work, the Occur flags need to be statically and publicly available Since the entire point of that Parameter class is to make the enum serializable, which is infact the case with C# (while it is not in Java 5), I just removed it and made Occur a native enum again All core tests pass (aside from 2 in TestOpenBitSet and TestWeakDictionaryBehavior, but they aren't related to this change). Commit details: http://svn.apache.org/viewvc?view=revisionrevision=1375296 On Tue, Aug 21, 2012 at 1:21 AM, Oren Eini (Ayende Rahien) aye...@ayende.com wrote: Instead of doing it this way, do NOT create Occur using separate static fields. Merge Parameter into Occur (only used there) and create the entire dictionary once. Otherwise, you run into risk of the ArgumentException. If that happens, because this is raised from the static ctor, you'll have killed the entire app domain. On Tue, Aug 21, 2012 at 1:19 AM, Itamar Syn-Hershko ita...@code972.com wrote: This will probably require releasing the core again as well as a new RC... The spatial module was updated, still doing some integration tests, will send more updates soon On Tue, Aug 21, 2012 at 1:14 AM, synhers...@apache.org wrote: Author: synhershko Date: Mon Aug 20 22:14:01 2012 New Revision: 1375282 URL: http://svn.apache.org/viewvc?rev=1375282view=rev Log: Fixing a possible NRE which can be thrown during a race condition on accessing allParameters This is not an air-tight solution, as an ArgumentException can still be thrown. I don't care much about doing this within a lock as it will never be a bottleneck. https://groups.google.com/group/ravendb/browse_thread/thread/a5cf07e80f70c856 Modified: incubator/lucene.net/trunk/src/core/Util/Parameter.cs Modified: incubator/lucene.net/trunk/src/core/Util/Parameter.cs URL: http://svn.apache.org/viewvc/incubator/lucene.net/trunk/src/core/Util/Parameter.cs?rev=1375282r1=1375281r2=1375282view=diff == --- incubator/lucene.net/trunk/src/core/Util/Parameter.cs (original) +++ incubator/lucene.net/trunk/src/core/Util/Parameter.cs Mon Aug 20 22 :14:01 2012 @@ -39,11 +39,13 @@ namespace Lucene.Net.Util // typesafe enum pattern, no public constructor this.name = name; string key = MakeKey(name); - - if (allParameters.ContainsKey(key)) - throw new System.ArgumentException(Parameter name + key + already used!); - - allParameters[key] = this; + + lock (allParameters) + { + if (allParameters.ContainsKey(key)) + throw new System.ArgumentException(Parameter name + key + already used!); + allParameters[key] = this; + } } private string MakeKey(string name)
Re: Outstanding issues for 3.0.3
Nowadays git works just great for Windows, and it's much easier to work with than Hg On Wed, Aug 1, 2012 at 9:41 PM, Zachary Gramana zgram...@feature23.comwrote: On Aug 1, 2012, at 12:51 PM, Itamar Syn-Hershko wrote: And for heaven's sake, can we move to git when graduating? Given that we're a .NET-focused community, and many of us are likely primarily using Windows as both our primary development and deployment platforms, I'd suggest looking at Mercurial before committing to git. Either way, +1 for any DVCS.
Re: Outstanding issues for 3.0.3
The point is to make the code better, not to satisfy R# :) The main benefit of this process is marking fields as readonly, finding code paths with stupid behavior and moving simple aggregations to use LINQ. I don't apply the LINQ syntax to a non-trivial operations, to make it easier to keep track of the Java version. My thoughts on the points you raised inline On Thu, Aug 2, 2012 at 6:53 PM, Zachary Gramana zgram...@gmail.com wrote: I would like to pitch into this effort and put my ReSharper license to use. I pulled down trunk, and picked a yellow item at random, and started to dig in. I quickly generated more questions than answers, realized I needed to stop munging code and consult the wiki and list archives. After digging through both, I'm still not entirely certain about what the style guidelines are for 3.x onward. I also noted this[1] discussion regarding some other guidelines, but it didn't see if it made it beyond the proposal stage. [1] http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201112.mbox/%3ccajtrbsrdbzkocwln6d6ywhzn2fno91mko1acrp-pflx62du...@mail.gmail.com%3E Here are some of the things Re# is catching that I'm unsure of: 1) Usage of this prefix when not required. this.blah = blah; - required this. this.aBlah = blah; - optional this, which Re# doesn't like. I'm assuming consistency wins here, and 'this.' stays, but wanted to double check. Doesn't really matter IMO. I just hit Alt-enter when I have it in focus, otherwise I ignore that. 2) Using different conventions for fields and parameters\local vars. blah vs. _blah Combined with 1, Re# wants (and I'm personally accustomed to): _blah = blah; However, that seems to violate the adopted style. I think we should stick to the Java naming conventions in the private parts (minus the function casings) as much as possible. Main reason is the ability to apply patches from Java Lucene and support future ports more easily. This is why I kept variable names untouched. 3) Full qualification of type names. Re # wants to remove redundant namespace qualifiers. Leave them or remove them? Same as Alt-Enter argument as above... 4) Removing unreferenced classes. Should I remove non-public unreferenced classes? The ones I've come across so far are private. It's .NET, not C++, but I still usually remove them, not really sure why tho... 5) var vs. explicit I know this has been brought up before, but not sure of the final disposition. FWIW, I prefer var. There are some non-Re# issues I came across as well that look like artifacts of code generation: I move to var because it *might* help in the future when the API changes, and it doesn't really affect anything now 6) Weird param names. Param1 vs. directory I assume it's okay to replace 'Param1' with something a descriptive name like 'directory'. Yes. Also var names like out_Renamed to @out. This one is important. 7) Field names that follow local variable naming conventions. Lots of issues related to private vars with names like i, j, k, etc. It feels like the right thing to do is to change the scope so that they go back to being local vars instead of fields. However, this requires a much more significant refactoring, and I didn't want to assume it was okay to do that. See above, I don't think we should touch those. If these questions have already been answered elsewhere and I missed the documentation/FAQ/developer guide, then I apologize and would appreciate the links. Alternatively, if someone has a Re# rule config that they are willing to post somewhere, I would be glad to use it. - Zack On Jul 27, 2012, at 12:00 PM, Itamar Syn-Hershko wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it).
Re: Outstanding issues for 3.0.3
Prescott - we could make an RC and push it to Nuget as a PreRelease, to get real feedback. On Thu, Aug 2, 2012 at 7:13 PM, Prescott Nasser geobmx...@hotmail.comwrote: I don't think we ever fully adopted the style guidelines, probably not a terrible discussion to have. As for this release, I think that by lazy consensus we should branch the trunk at the end of this weekend (say monday), and begin the process of cutting a release. - my $.02 below 1) Usage of this prefix when not required. this.blah = blah; - required this. this.aBlah = blah; - optional this, which Re# doesn't like. I'm assuming consistency wins here, and 'this.' stays, but wanted to double check. I'd error with consistency 2) Using different conventions for fields and parameters\local vars. blah vs. _blah Combined with 1, Re# wants (and I'm personally accustomed to): _blah = blah; For private variables _ is ok, for anything else, don't use _ as it's not CLR compliant However, that seems to violate the adopted style. 3) Full qualification of type names. Re # wants to remove redundant namespace qualifiers. Leave them or remove them? I try to remove them 4) Removing unreferenced classes. Should I remove non-public unreferenced classes? The ones I've come across so far are private. I'm not sure I understand - are you saying we have classes that are never used in random places? If so, I think before removing them we should have a conversation; what are they, why are they there, etc. - I'm hoping there aren't too many of these.. 5) var vs. explicit I know this has been brought up before, but not sure of the final disposition. FWIW, I prefer var. I use var with it's plainly obvious the object var obj = new MyClass(). I usually use explicit when it's an object returned from some function that makes it unclear what the return value is: var items = search.GetResults(); vs IListSearchResult items = search.GetResults(); //prefer There are some non-Re# issues I came across as well that look like artifacts of code generation: 6) Weird param names. Param1 vs. directory I assume it's okay to replace 'Param1' with something a descriptive name like 'directory'. Weird - I think a rename is OK for this release (Since we're ticking up a full version number), but I believe changing param names can potentially break code. That said, I don't really think we need to change the names and push the 3.0.3 release out, and if it does in fact cause breaking changes, I'd be a little careful about how we do it going forward to 3.6. 7) Field names that follow local variable naming conventions. Lots of issues related to private vars with names like i, j, k, etc. It feels like the right thing to do is to change the scope so that they go back to being local vars instead of fields. However, this requires a much more significant refactoring, and I didn't want to assume it was okay to do that. I'd avoid this for now - a lot of this is a carry over from the java version and to rename all those, it starts to get a bit confusing if we have to compare java to C# and these are all changed around. If these questions have already been answered elsewhere and I missed the documentation/FAQ/developer guide, then I apologize and would appreciate the links. Alternatively, if someone has a Re# rule config that they are willing to post somewhere, I would be glad to use it. I think we talked about Re#'s rules at one point, I'll try to dig that conversation up and see where it landed. It's probably a good idea for us to build rules though. - Zack On Jul 27, 2012, at 12:00 PM, Itamar Syn-Hershko wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it).
Re: Lucene Nuget
Yes, with the due release. I, for once, always mistake one for another. On Wed, Aug 1, 2012 at 4:09 AM, Prescott Nasser geobmx...@hotmail.comwrote: There are two packages Lucene packages on nuget that are depreciated. With some updates nuget made a while ago, we have the ability to remove those packages. Do we want to?
Re: Outstanding issues for 3.0.3
+1 from me too, then On Wed, Aug 1, 2012 at 7:42 PM, Prescott Nasser geobmx...@hotmail.comwrote: Spatial could be something cool to look forward to in 3.6 IMO. I'm good with tagging what we have and I'd like to take a week to allow the community test the tag code against their stuff before cutting release binaries. +1 to going now. Date: Wed, 1 Aug 2012 19:31:45 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.org I agree What about the spatial stuff? you guys want to wait for it? On Wed, Aug 1, 2012 at 7:19 PM, Christopher Currens currens.ch...@gmail.com wrote: I think that while it would be nice to get it done, it's a fairly large effort, and we might be better off with doing a release. The tests are massively changed between 3.0.3 and 3.6, so I think a lot of it will get cleaned up anyway during the port. Also, a little while back, I did clean up a lot of the test code to use Assert.Throws and to remove unnecessary variables, though that might have only been in catch statements. Either way, I think we just might be ready as it is. I am eager to start working on porting 3.6. Thanks, Christopher On Wed, Aug 1, 2012 at 9:14 AM, Itamar Syn-Hershko ita...@code972.com wrote: I still have plenty to go on, but on a second thought we could do that work just the same when we work towards 3.6, so I won't hold you off anymore Up to Chris - he wanted to do some tests cleanup Also, I'll be updating the Spatial contrib during the next week or so with polygon support. I think we should hold off the release so we can provide that as well, but I suggest we will take a vote on it, don't let me hold you off. On Wed, Aug 1, 2012 at 6:58 PM, Prescott Nasser geobmx...@hotmail.com wrote: Just wanted to check in - where do we feel like we stand? What is left to do - is there anything I can help with specifically? I'll have some spare cycles this weekend. I want to really make a push to get this ready to roll and not let it languish ~P Date: Sat, 28 Jul 2012 20:38:10 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Go ahead with contrib and tests, ill resume with core and coordinate further later On Jul 27, 2012 7:04 PM, Christopher Currens currens.ch...@gmail.com wrote: I've got resharper and can help with that if you'd like to coordinate it. I can take a one or some of the contrib projects or part of the main library, or *shudder* the any of the test libraries. The code has needed come cleaning up for a while and some of the clean up work is an optimization some levels, so I'm definitely okay with spending some time doing that. I'm okay with waiting longer as long as something is getting done. Thanks, Christopher On Fri, Jul 27, 2012 at 9:00 AM, Itamar Syn-Hershko ita...@code972.com wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it). On Fri, Jul 27, 2012 at 6:35 PM, Christopher Currens currens.ch...@gmail.com wrote: Itamar, Where do we stand on the clean up now? Is there anything in particular that you're doing that you'd like help with? I have some free time today and am eager to get this version released. Thanks, Christopher On Sat, Jul 21, 2012 at 1:02 PM, Prescott Nasser geobmx...@hotmail.com wrote: Alright, I'll hold off a bit. Date: Sat, 21 Jul 2012 22:59:32 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-u...@lucene.apache.org CC: lucene-net-dev@lucene.apache.org Actually there was some clean up work I started doing and would want to complete, and also sign off on the suspected corruption issue we
Re: Outstanding issues for 3.0.3
Yes, we could also release a 3.0.10 or something with the improved spatial module. Or I can race Prescott's week and get it in before it ends :) And for heaven's sake, can we move to git when graduating? A live crash course to all committers is on me. On Wed, Aug 1, 2012 at 7:42 PM, Christopher Currens currens.ch...@gmail.com wrote: Ah, I did overlook that. I imagine that the move from 3.0.3 to 3.6 will realistically take a while, so if we can't get spatial stuff out before then, would it take until 3.6 to be able to release new functionality into the spatial contrib project? Along those lines, I propose that we move 3.0.3 into a new branch instead of just tagging the release and merging in 3.6. That way, during the time it takes to port 3.6, we can still do any critical bug fixes and features like these and release new versions. At least then, people won't be waiting for months for bug fixes. If we did that, then it also might not be critical to get the spatial stuff out with this release, since we could get out a new release in a few weeks with updated spatial libraries...not that I'm against waiting for it now. It was just a suggestion on how we can move forward with the project. Thoughts either way on this? On Wed, Aug 1, 2012 at 9:31 AM, Itamar Syn-Hershko ita...@code972.com wrote: I agree What about the spatial stuff? you guys want to wait for it? On Wed, Aug 1, 2012 at 7:19 PM, Christopher Currens currens.ch...@gmail.com wrote: I think that while it would be nice to get it done, it's a fairly large effort, and we might be better off with doing a release. The tests are massively changed between 3.0.3 and 3.6, so I think a lot of it will get cleaned up anyway during the port. Also, a little while back, I did clean up a lot of the test code to use Assert.Throws and to remove unnecessary variables, though that might have only been in catch statements. Either way, I think we just might be ready as it is. I am eager to start working on porting 3.6. Thanks, Christopher On Wed, Aug 1, 2012 at 9:14 AM, Itamar Syn-Hershko ita...@code972.com wrote: I still have plenty to go on, but on a second thought we could do that work just the same when we work towards 3.6, so I won't hold you off anymore Up to Chris - he wanted to do some tests cleanup Also, I'll be updating the Spatial contrib during the next week or so with polygon support. I think we should hold off the release so we can provide that as well, but I suggest we will take a vote on it, don't let me hold you off. On Wed, Aug 1, 2012 at 6:58 PM, Prescott Nasser geobmx...@hotmail.com wrote: Just wanted to check in - where do we feel like we stand? What is left to do - is there anything I can help with specifically? I'll have some spare cycles this weekend. I want to really make a push to get this ready to roll and not let it languish ~P Date: Sat, 28 Jul 2012 20:38:10 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Go ahead with contrib and tests, ill resume with core and coordinate further later On Jul 27, 2012 7:04 PM, Christopher Currens currens.ch...@gmail.com wrote: I've got resharper and can help with that if you'd like to coordinate it. I can take a one or some of the contrib projects or part of the main library, or *shudder* the any of the test libraries. The code has needed come cleaning up for a while and some of the clean up work is an optimization some levels, so I'm definitely okay with spending some time doing that. I'm okay with waiting longer as long as something is getting done. Thanks, Christopher On Fri, Jul 27, 2012 at 9:00 AM, Itamar Syn-Hershko ita...@code972.com wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it). On Fri, Jul 27, 2012 at 6:35 PM, Christopher Currens currens.ch...@gmail.com wrote: Itamar, Where do we stand on the clean up now? Is there anything in particular that you're doing that you'd like help with? I have some free
Re: Outstanding issues for 3.0.3
On that note, see git-flow http://nvie.com/posts/a-successful-git-branching-model/ :) On Wed, Aug 1, 2012 at 7:49 PM, Prescott Nasser geobmx...@hotmail.comwrote: That's probably not a bad idea - we should probably move to a structure like that anyway going forward so that it's easier to manage bug fixes and minor updates in between the big work Date: Wed, 1 Aug 2012 09:42:40 -0700 Subject: Re: Outstanding issues for 3.0.3 From: currens.ch...@gmail.com To: lucene-net-dev@lucene.apache.org Ah, I did overlook that. I imagine that the move from 3.0.3 to 3.6 will realistically take a while, so if we can't get spatial stuff out before then, would it take until 3.6 to be able to release new functionality into the spatial contrib project? Along those lines, I propose that we move 3.0.3 into a new branch instead of just tagging the release and merging in 3.6. That way, during the time it takes to port 3.6, we can still do any critical bug fixes and features like these and release new versions. At least then, people won't be waiting for months for bug fixes. If we did that, then it also might not be critical to get the spatial stuff out with this release, since we could get out a new release in a few weeks with updated spatial libraries...not that I'm against waiting for it now. It was just a suggestion on how we can move forward with the project. Thoughts either way on this? On Wed, Aug 1, 2012 at 9:31 AM, Itamar Syn-Hershko ita...@code972.com wrote: I agree What about the spatial stuff? you guys want to wait for it? On Wed, Aug 1, 2012 at 7:19 PM, Christopher Currens currens.ch...@gmail.com wrote: I think that while it would be nice to get it done, it's a fairly large effort, and we might be better off with doing a release. The tests are massively changed between 3.0.3 and 3.6, so I think a lot of it will get cleaned up anyway during the port. Also, a little while back, I did clean up a lot of the test code to use Assert.Throws and to remove unnecessary variables, though that might have only been in catch statements. Either way, I think we just might be ready as it is. I am eager to start working on porting 3.6. Thanks, Christopher On Wed, Aug 1, 2012 at 9:14 AM, Itamar Syn-Hershko ita...@code972.com wrote: I still have plenty to go on, but on a second thought we could do that work just the same when we work towards 3.6, so I won't hold you off anymore Up to Chris - he wanted to do some tests cleanup Also, I'll be updating the Spatial contrib during the next week or so with polygon support. I think we should hold off the release so we can provide that as well, but I suggest we will take a vote on it, don't let me hold you off. On Wed, Aug 1, 2012 at 6:58 PM, Prescott Nasser geobmx...@hotmail.com wrote: Just wanted to check in - where do we feel like we stand? What is left to do - is there anything I can help with specifically? I'll have some spare cycles this weekend. I want to really make a push to get this ready to roll and not let it languish ~P Date: Sat, 28 Jul 2012 20:38:10 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Go ahead with contrib and tests, ill resume with core and coordinate further later On Jul 27, 2012 7:04 PM, Christopher Currens currens.ch...@gmail.com wrote: I've got resharper and can help with that if you'd like to coordinate it. I can take a one or some of the contrib projects or part of the main library, or *shudder* the any of the test libraries. The code has needed come cleaning up for a while and some of the clean up work is an optimization some levels, so I'm definitely okay with spending some time doing that. I'm okay with waiting longer as long as something is getting done. Thanks, Christopher On Fri, Jul 27, 2012 at 9:00 AM, Itamar Syn-Hershko ita...@code972.com wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up
Re: Outstanding issues for 3.0.3
I agree What about the spatial stuff? you guys want to wait for it? On Wed, Aug 1, 2012 at 7:19 PM, Christopher Currens currens.ch...@gmail.com wrote: I think that while it would be nice to get it done, it's a fairly large effort, and we might be better off with doing a release. The tests are massively changed between 3.0.3 and 3.6, so I think a lot of it will get cleaned up anyway during the port. Also, a little while back, I did clean up a lot of the test code to use Assert.Throws and to remove unnecessary variables, though that might have only been in catch statements. Either way, I think we just might be ready as it is. I am eager to start working on porting 3.6. Thanks, Christopher On Wed, Aug 1, 2012 at 9:14 AM, Itamar Syn-Hershko ita...@code972.com wrote: I still have plenty to go on, but on a second thought we could do that work just the same when we work towards 3.6, so I won't hold you off anymore Up to Chris - he wanted to do some tests cleanup Also, I'll be updating the Spatial contrib during the next week or so with polygon support. I think we should hold off the release so we can provide that as well, but I suggest we will take a vote on it, don't let me hold you off. On Wed, Aug 1, 2012 at 6:58 PM, Prescott Nasser geobmx...@hotmail.com wrote: Just wanted to check in - where do we feel like we stand? What is left to do - is there anything I can help with specifically? I'll have some spare cycles this weekend. I want to really make a push to get this ready to roll and not let it languish ~P Date: Sat, 28 Jul 2012 20:38:10 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-...@lucene.apache.org Go ahead with contrib and tests, ill resume with core and coordinate further later On Jul 27, 2012 7:04 PM, Christopher Currens currens.ch...@gmail.com wrote: I've got resharper and can help with that if you'd like to coordinate it. I can take a one or some of the contrib projects or part of the main library, or *shudder* the any of the test libraries. The code has needed come cleaning up for a while and some of the clean up work is an optimization some levels, so I'm definitely okay with spending some time doing that. I'm okay with waiting longer as long as something is getting done. Thanks, Christopher On Fri, Jul 27, 2012 at 9:00 AM, Itamar Syn-Hershko ita...@code972.com wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it). On Fri, Jul 27, 2012 at 6:35 PM, Christopher Currens currens.ch...@gmail.com wrote: Itamar, Where do we stand on the clean up now? Is there anything in particular that you're doing that you'd like help with? I have some free time today and am eager to get this version released. Thanks, Christopher On Sat, Jul 21, 2012 at 1:02 PM, Prescott Nasser geobmx...@hotmail.com wrote: Alright, I'll hold off a bit. Date: Sat, 21 Jul 2012 22:59:32 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-u...@lucene.apache.org CC: lucene-net-...@lucene.apache.org Actually there was some clean up work I started doing and would want to complete, and also sign off on the suspected corruption issue we raised. I'm afraid I won't have much time this week to properly do all that, but I'll keep you posted. On Sat, Jul 21, 2012 at 10:20 PM, Prescott Nasser geobmx...@hotmail.com wrote: Alright, latest patch fixed what could be done with the cls issues at present. With that, I think we are ready to roll with a release. If people could please take some time to run all the test as well as whatever other tests they might run. We've had some issues with tests only happening on some systems so I want to make sure we have those bases covered. Unless there is anything else that should be done, I'll leave every one
Re: Outstanding issues for 3.0.3
Go ahead with contrib and tests, ill resume with core and coordinate further later On Jul 27, 2012 7:04 PM, Christopher Currens currens.ch...@gmail.com wrote: I've got resharper and can help with that if you'd like to coordinate it. I can take a one or some of the contrib projects or part of the main library, or *shudder* the any of the test libraries. The code has needed come cleaning up for a while and some of the clean up work is an optimization some levels, so I'm definitely okay with spending some time doing that. I'm okay with waiting longer as long as something is getting done. Thanks, Christopher On Fri, Jul 27, 2012 at 9:00 AM, Itamar Syn-Hershko ita...@code972.com wrote: The cleanup consists mainly of going file by file with ReSharper and trying to get them as green as possible. Making a lot of fields readonly, removing unused vars and stuff like that. There are still loads of files left. I was also hoping to get to updating the spatial module with some recent updates, and to also support polygon searches. But that may take a bit more time, so it's really up to you guys (or we can open a vote for it). On Fri, Jul 27, 2012 at 6:35 PM, Christopher Currens currens.ch...@gmail.com wrote: Itamar, Where do we stand on the clean up now? Is there anything in particular that you're doing that you'd like help with? I have some free time today and am eager to get this version released. Thanks, Christopher On Sat, Jul 21, 2012 at 1:02 PM, Prescott Nasser geobmx...@hotmail.com wrote: Alright, I'll hold off a bit. Date: Sat, 21 Jul 2012 22:59:32 +0300 Subject: Re: Outstanding issues for 3.0.3 From: ita...@code972.com To: lucene-net-u...@lucene.apache.org CC: lucene-net-dev@lucene.apache.org Actually there was some clean up work I started doing and would want to complete, and also sign off on the suspected corruption issue we raised. I'm afraid I won't have much time this week to properly do all that, but I'll keep you posted. On Sat, Jul 21, 2012 at 10:20 PM, Prescott Nasser geobmx...@hotmail.com wrote: Alright, latest patch fixed what could be done with the cls issues at present. With that, I think we are ready to roll with a release. If people could please take some time to run all the test as well as whatever other tests they might run. We've had some issues with tests only happening on some systems so I want to make sure we have those bases covered. Unless there is anything else that should be done, I'll leave every one a week to run their tests. Next saturday I will tag the trunk and cut a release with both 3.5 and 4.0 binaries. Great work everyone. ~P Date: Mon, 9 Jul 2012 18:02:30 -0700 Subject: Re: Outstanding issues for 3.0.3 From: currens.ch...@gmail.com To: lucene-net-dev@lucene.apache.org I can set a different build target, but I can't set the actual framework to 3.5 without doing it for all build configurations. On top of that, 3.5 needs System.Core to be referenced, which is done automatically in .NET 4 (I'm not sure if MSBuild v4 does it automatically?). I did kinda get it working by putting a TargetFrameworkVersion tag of 4.0 in Debug and Release configurations and 3.5 in Debug 3.5 and Release 3.5 configurations, but that's a little...well, difficult to maintain by hand since visual studio doesn't allow you to set different framework versions per configuration, and visual studio seemed to be having trouble with references, since both frameworks were being referenced. On Mon, Jul 9, 2012 at 5:57 PM, Prescott Nasser geobmx...@hotmail.com wrote: What do you mean doesn't work at the project level? I created a different build target NET35 and then we had Debug and Release still, that seemed to work for me. But I feel like I'm missing something in your explaination. Good work though! Date: Mon, 9 Jul 2012 17:51:36 -0700 Subject: Re: Outstanding issues for 3.0.3 From: currens.ch...@gmail.com To: lucene-net-dev@lucene.apache.org I've got it working, compiling and all test passing...The only caveat is that I'm not sure the best way to multi-target. It doesn't really work on a project level, so you'd have to create two separate projects, one for .NET 4 and the other for 3.5. To aid me, I wrote a small tool that creates copies of all of the 4.0 projects and solutions to work against
Re: Outstanding issues for 3.0.3
in the community. I would love to see that make it into 3.0.3, and would be able to pick up where anyone had left off or take part of it, if they don't have time to work on it. In regards to LUCENENET-446, I agree that it is pretty much complete. I think I've looked several times at it to confirm most/all methods have been converted, so this week I'll do a final check and close it out. Thanks, Christopher On Sun, Jul 8, 2012 at 12:28 PM, Simon Svensson si...@devhost.se wrote: The tests that failed when using culture=sv-se seems fixed. On 2012-07-08 20:44, Itamar Syn-Hershko wrote: What's the status on the failing tests we had? On Sun, Jul 8, 2012 at 9:02 PM, Prescott Nasser geobmx...@hotmail.com wrote: Three issues left that I see: Fixing the build output, I did some work, but I'm good on this, we can move the rest of work to 3.6 https://issues.apache.org/**jira/browse/LUCENENET-456 https://issues.apache.org/jira/browse/LUCENENET-456 CLS Compliance https://issues.apache.org/**jira/browse/LUCENENET-446 https://issues.apache.org/jira/browse/LUCENENET-446. Are we ok with this as for now? There are still a good number of issues where, some we can't really fix (sbyte and volatile are out of scope imo). In a similiar vein, our own code uses some obsolete methods and we have a lot of variable declared but never used warnings (mentally, I treat most warning as an error) GetX/SetX - https://issues.apache.org/**jira/browse/LUCENENET-470 https://issues.apache.org/jira/browse/LUCENENET-470. I think much of this has been removed, there are probably some pieces that left (and we have a difference of opinion in the group as well). I really think the only outstanding issue is the CLS compliance one, the rest can be moved to 3.6. With CLS compliance we have to ask if we've done enough for that so far, or if more is needed. I personally would like to see us make any API changes now, with the 3.0.3 release, but if we are comfortable with it, lets roll. What are your thoughts? ~P --**-- From: thowar...@gmail.com Date: Mon, 25 Jun 2012 10:34:37 -0700 Subject: Re: Outstanding issues for 3.0.3 To: lucene-net-dev@lucene.apache.**org lucene-net-dev@lucene.apache.org Assuming we're talking about the packaging/filesystem structure in the releases, the structure is a little of both (ours vs Apache's)... Basically, I went through most of the Apache projects to see how they packaged releases and developed a structure that was very similar but encompassed everything we needed. So, it's informed by the organically emergent structures that ASF uses. -T On Mon, Jun 25, 2012 at 7:32 AM, Prescott Nasser geobmx...@hotmail.com wrote: I have no idea why I thought we were using Nant. I think it's just our release structure. I figured a little out this weekend, splitting the XML and .dll files into separate directories. The documentation you have on the wiki was actually pretty helpful. Whatever more you can add would be great ~P Date: Mon, 25 Jun 2012 10:04:21 -0400 Subject: Re: Outstanding issues for 3.0.3 From: mhern...@wickedsoftware.net To: lucene-net-dev@lucene.apache.**org lucene-net-dev@lucene.apache.org On Sat, Jun 23, 2012 at 1:38 AM, Prescott Nasser geobmx...@hotmail.comwrote: -- Task 470, a non-serious one, is listed only because it's mostly done and just need a few loose ends tied up. I'll hopefully have time to take care of that this weekend. How many GetX/SetX are left? I did a quick search for 'public * Get*()' Most
Re: Outstanding issues for 3.0.3
What's the status on the failing tests we had? On Sun, Jul 8, 2012 at 9:02 PM, Prescott Nasser geobmx...@hotmail.comwrote: Three issues left that I see: Fixing the build output, I did some work, but I'm good on this, we can move the rest of work to 3.6 https://issues.apache.org/jira/browse/LUCENENET-456 CLS Compliance https://issues.apache.org/jira/browse/LUCENENET-446. Are we ok with this as for now? There are still a good number of issues where, some we can't really fix (sbyte and volatile are out of scope imo). In a similiar vein, our own code uses some obsolete methods and we have a lot of variable declared but never used warnings (mentally, I treat most warning as an error) GetX/SetX - https://issues.apache.org/jira/browse/LUCENENET-470. I think much of this has been removed, there are probably some pieces that left (and we have a difference of opinion in the group as well). I really think the only outstanding issue is the CLS compliance one, the rest can be moved to 3.6. With CLS compliance we have to ask if we've done enough for that so far, or if more is needed. I personally would like to see us make any API changes now, with the 3.0.3 release, but if we are comfortable with it, lets roll. What are your thoughts? ~P From: thowar...@gmail.com Date: Mon, 25 Jun 2012 10:34:37 -0700 Subject: Re: Outstanding issues for 3.0.3 To: lucene-net-dev@lucene.apache.org Assuming we're talking about the packaging/filesystem structure in the releases, the structure is a little of both (ours vs Apache's)... Basically, I went through most of the Apache projects to see how they packaged releases and developed a structure that was very similar but encompassed everything we needed. So, it's informed by the organically emergent structures that ASF uses. -T On Mon, Jun 25, 2012 at 7:32 AM, Prescott Nasser geobmx...@hotmail.com wrote: I have no idea why I thought we were using Nant. I think it's just our release structure. I figured a little out this weekend, splitting the XML and .dll files into separate directories. The documentation you have on the wiki was actually pretty helpful. Whatever more you can add would be great ~P Date: Mon, 25 Jun 2012 10:04:21 -0400 Subject: Re: Outstanding issues for 3.0.3 From: mhern...@wickedsoftware.net To: lucene-net-dev@lucene.apache.org On Sat, Jun 23, 2012 at 1:38 AM, Prescott Nasser geobmx...@hotmail.comwrote: -- Task 470, a non-serious one, is listed only because it's mostly done and just need a few loose ends tied up. I'll hopefully have time to take care of that this weekend. How many GetX/SetX are left? I did a quick search for 'public * Get*()' Most of them looked to be actual methods - perhaps a few to replace -- Task 446 (CLS Compliance), is important, but there's no way we can get this done quickly. The current state of this issue is that all of the names of public members are now compliant. There are a few things that aren't, the use of sbyte (particularly those related to the FieldCache) and some conflicts with *protected or internal* fields (some with public members). Opinions on this one will be appreciated the most. My opinion is that we should draw a line on the amount of CLS compliance to have in this release, and push the rest into 3.5. I count roughly 53 CLS compliant issues. the sbyte stuff will run into trouble when you do bit shifting (I ran into this issue when trying to do this for 2.9.4. I'd like to see if we can't get rid of the easier stuff (internal/protected stuff). I would not try getting rid of sbyte or volatile for thile release. It's going to take some serious consideration to get rid of those -- Improvement 337 - Are we going to add this code (not present in java) to the core library? I'd skip it and re-evaluate the community desire for this in 3.5. -- Improvement 456 - This is related to builds being output in Apache's release format. Do we want to do this for this release? I looked into this last weekend - I'm terrible with Nant, so I didn't get anywhere. It would be nice to have, but I don't think I'll figure it out. If Michael has some time to maybe make the adjustment, he knows these scripts best. If not I'm going to look into it, but I don't call this a show stopper - either we have it or we don't when the rest is done. With some Flo Rida and expresso shots, anything is possible. Did we switch to Nant? I saw the jira ticket for this. Is there an official apache release structure or this just our* apache release structure that we are using? Can I take the latest release and use that to model the structure you
Re: [VOTE] Apache Lucene.Net ready for graduation?
+1 for graduation I still think graduation should be in sync with the 3.0.3 release and a press release on work towards 3.6 and 4.0 releases. On Sun, Jul 8, 2012 at 8:44 PM, Prescott Nasser geobmx...@hotmail.comwrote: Hey All, This is the first step for graduation for the Apache Lucene.Net project (incubating of course..). We're taking a vote for the Lucene.Net community to see if the community is ready to govern itself as a top level project. Here is a short list of our accomplishments which I believe make us ready for graduation: - Released 2.9.4 - Released 2.9.4g (Generics version) - created a new website, with a new logo (a 99designs contest gracious supported by stackoverflow) - Added two new committers bringing our total to 9. - Preparing for 3.0.3 Release within the next couple of weeks - Started work on 3.5 release. This is the process we will follow: - Community vote (this email). All votes count, there is no non-binding / binding status for this - We will propose a resolution for review ( https://cwiki.apache.org/confluence/display/LUCENENET/Graduation+-+Resolution+Template ) - We will call a vote on the resolution in general @ incubator - A Board resolution will be submitted. As a community, if you would please vote: [1] Ready for graduation [-1] Not ready because... I know I speak for all the developers on this project, we appreciate (and will continue to appreciate) everyone's contributions via the mailing list and jira. ~Prescott
Re: svn commit: r1353075 - /incubator/lucene.net/branches/Lucene.Net_3_5/
Why 3.5 and not 3.6? In my opinion we should skip all versions in between 3.0.3 and 3.6, and just port 3.6 after we released 3.0.3. Lucene 4 will probably be released by the time we are done, and then we could move on to porting it. On Sat, Jun 23, 2012 at 9:35 AM, pnas...@apache.org wrote: Author: pnasser Date: Sat Jun 23 06:35:44 2012 New Revision: 1353075 URL: http://svn.apache.org/viewvc?rev=1353075view=rev Log: Branching for 3.5 Added: incubator/lucene.net/branches/Lucene.Net_3_5/ (props changed) - copied from r1353074, incubator/lucene.net/trunk/ Propchange: incubator/lucene.net/branches/Lucene.Net_3_5/ -- --- svn:mergeinfo (added) +++ svn:mergeinfo Sat Jun 23 06:35:44 2012 @@ -0,0 +1,2 @@ +/incubator/lucene.net/branches/Lucene.Net.3.0.3/trunk:1199075-1294851* +/incubator/lucene.net/trunk:1199072-1294798*
Re: Endian types
To add to this - Lucene 4x is still being worked on in the Java front. We rather put efforts on porting v3.6 and start on v4 once there is an official Java release Thanks for your efforts! On Wed, Jun 20, 2012 at 6:19 PM, Prescott Nasser geobmx...@hotmail.comwrote: How much are you trying to port? I've got it on my roadmap to work with sharpen to try and get most of it auto ported. Any porting help is of course appreciated and welcome - but if you so have some time and are so inclined we could use more people helping on the sharpen front. From: Oren Eini (Ayende Rahien) Sent: 6/20/2012 7:52 AM To: lucene-net-...@lucene.apache.org Subject: Re: Endian types I would assume that you would have to match the java behavior, if only to make sure that the index format matched. On Wed, Jun 20, 2012 at 5:47 PM, Kim Christensen k...@dubex.dk wrote: Hi all, I was looking into porting some Lucene 4x code, and ran into the issue about Big-Endian and Little-Endian. What is the standpoint on this? Always Big-Endian as Java does it? Regards, Kim
[jira] [Commented] (LUCENENET-495) Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations
[ https://issues.apache.org/jira/browse/LUCENENET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393629#comment-13393629 ] Itamar Syn-Hershko commented on LUCENENET-495: -- 1. IMO, if there is a thread safety bug, it needs to be fixed 2. Why do we have AddIfNotContains(Hashtable, object), and we are not using ConcurrentDictionary? Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations -- Key: LUCENENET-495 URL: https://issues.apache.org/jira/browse/LUCENENET-495 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Core Affects Versions: Lucene.Net 2.9.4, Lucene.Net 3.0.3 Reporter: Christopher Currens Assignee: Christopher Currens Priority: Critical Fix For: Lucene.Net 3.0.3 This issue mostly just affects RAMDirectory. However, RAMFile and RAMOutputStream are used in other (all?) directory implementations, including FSDirectory types. In RAMOutputStream, the file last modified property for the RAMFile is updated when the stream is flushed. It's calculated using {{DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond}}. I've read before that Microsoft has regretted making DateTime.Now a property instead of a method, and after seeing what it's doing, I'm starting to understand why. DateTime.Now is returning local time. In order for it to calculate that, it has to get the utf offset for the machine, which requires the creation of a _class_, System.Globalization.DaylightTime. This is bad for performance. Using code to write 10,000 small documents to an index (4kb sizes), it created 1,570,157 of these DaylightTime classes, a total of 62MB of extra memory...clearly RAMOutputStream.Flush() is called a lot. A fix I'd like to propose is to change the RAMFile from storing the LastModified date to UTC instead of local. DateTime.UtcNow doesn't create any additional objects and is very fast. For this small benchmark, the performance increase is 31%. I've set it to convert to local-time, when {{RAMDirectory.LastModified(string name)}} is called to make sure it has the same behavior (tests fail otherwise). Are there any other side-effects to making this change? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-495) Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations
[ https://issues.apache.org/jira/browse/LUCENENET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393633#comment-13393633 ] Itamar Syn-Hershko commented on LUCENENET-495: -- Makes sense Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations -- Key: LUCENENET-495 URL: https://issues.apache.org/jira/browse/LUCENENET-495 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Core Affects Versions: Lucene.Net 2.9.4, Lucene.Net 3.0.3 Reporter: Christopher Currens Assignee: Christopher Currens Priority: Critical Fix For: Lucene.Net 3.0.3 This issue mostly just affects RAMDirectory. However, RAMFile and RAMOutputStream are used in other (all?) directory implementations, including FSDirectory types. In RAMOutputStream, the file last modified property for the RAMFile is updated when the stream is flushed. It's calculated using {{DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond}}. I've read before that Microsoft has regretted making DateTime.Now a property instead of a method, and after seeing what it's doing, I'm starting to understand why. DateTime.Now is returning local time. In order for it to calculate that, it has to get the utf offset for the machine, which requires the creation of a _class_, System.Globalization.DaylightTime. This is bad for performance. Using code to write 10,000 small documents to an index (4kb sizes), it created 1,570,157 of these DaylightTime classes, a total of 62MB of extra memory...clearly RAMOutputStream.Flush() is called a lot. A fix I'd like to propose is to change the RAMFile from storing the LastModified date to UTC instead of local. DateTime.UtcNow doesn't create any additional objects and is very fast. For this small benchmark, the performance increase is 31%. I've set it to convert to local-time, when {{RAMDirectory.LastModified(string name)}} is called to make sure it has the same behavior (tests fail otherwise). Are there any other side-effects to making this change? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-495) Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations
[ https://issues.apache.org/jira/browse/LUCENENET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393363#comment-13393363 ] Itamar Syn-Hershko commented on LUCENENET-495: -- +1 Take a look at DateTimeOffset as well - this becomes the standard for .NET 4 + Use of DateTime.Now causes huge amount of System.Globalization.DaylightTime object allocations -- Key: LUCENENET-495 URL: https://issues.apache.org/jira/browse/LUCENENET-495 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Core Affects Versions: Lucene.Net 2.9.4, Lucene.Net 3.0.3 Reporter: Christopher Currens Assignee: Christopher Currens Priority: Critical Fix For: Lucene.Net 3.0.3 This issue mostly just affects RAMDirectory. However, RAMFile and RAMOutputStream are used in other (all?) directory implementations, including FSDirectory types. In RAMOutputStream, the file last modified property for the RAMFile is updated when the stream is flushed. It's calculated using {{DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond}}. I've read before that Microsoft has regretted making DateTime.Now a property instead of a method, and after seeing what it's doing, I'm starting to understand why. DateTime.Now is returning local time. In order for it to calculate that, it has to get the utf offset for the machine, which requires the creation of a _class_, System.Globalization.DaylightTime. This is bad for performance. Using code to write 10,000 small documents to an index (4kb sizes), it created 1,570,157 of these DaylightTime classes, a total of 62MB of extra memory...clearly RAMOutputStream.Flush() is called a lot. A fix I'd like to propose is to change the RAMFile from storing the LastModified date to UTC instead of local. DateTime.UtcNow doesn't create any additional objects and is very fast. For this small benchmark, the performance increase is 31%. I've set it to convert to local-time, when {{RAMDirectory.LastModified(string name)}} is called to make sure it has the same behavior (tests fail otherwise). Are there any other side-effects to making this change? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Lets talk graduation
+1 for releasing after graduation, then With some careful PR and our sponsorship offer, we can get the project flying There's still some work to do anyway On Fri, Jun 15, 2012 at 1:59 PM, Stefan Bodewig bode...@apache.org wrote: On 2012-06-14, Christopher Currens wrote: I've gone back and forth on whether I think we're ready for graduation or not. I had always felt like we weren't because the project isn't as active as I'd like it to be. However, I think I've been looking at it wrong. We've got a good enough process and we *have* made progress. Absolutely, and I think you are ready to graduate as well. As a response to Itamar: Lucene.Net could get more exposure by becoming a top level project. In particular you could craft a press release together with the ASF's PR folks to celebrate the re-birth. The sponsoring offer is a great thing, IMHO. I'm up for starting this process, but I don't want it to take any time away from getting 3.0.3 released. Understood. OTOH if you'd graduate first then 3.0.3 would be an official Apache release and didn't have to wear the incubating tag. Your call. If you want to do the 3.0.3 release first, I don't think that will be much of delay as it seems to be around the corner anyway. Stefan
Releasing 3.0.3
Where do we stand with this? I want to push to a 3.0.3 release, what items are still pending? Itamar.
Re: Lets talk graduation
IMHO, whatever brings more attention to the project, and I'm not sure graduation is what this project needs right now. In the end it's just semantics. I'd focus those efforts on getting more work done and having more frequent releases. Hence our proposition to sponsor dev, which still stands. On Thu, Jun 14, 2012 at 6:24 PM, Prescott Nasser geobmx...@hotmail.comwrote: I think with the addition of two new committers we've made some progress in community growth. I think we'll have 3.0.3 out the door soon - are there any other items we think we need to address before looking to graduate? ~P
Re: Releasing 3.0.3
Ok, and is the code in 100% compliance with the 3.0.3 Java code? I'll be spending some time on fixing the index corruption issue, and it is probably best for Chris to wrap up the work he has started Anyone else on board to close some tickets? On Thu, Jun 14, 2012 at 6:19 PM, Prescott Nasser geobmx...@hotmail.comwrote: Agreed - JIRA for 3.0.3 https://issues.apache.org/jira/browse/LUCENENET/fixforversion/12316215#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel We should evaluate all of these - fix them, mark as won't fix, or move them to another release version. I think the biggest hold up currently is: https://issues.apache.org/jira/browse/LUCENENET-484. Chris has made a huge dent, but there are two test cases that are still listed as failing (I can't even duplicate those failures to know where to start) Also we should look at all the other jira tickets and make updates where appropriate ~P Date: Thu, 14 Jun 2012 13:21:04 +0300 Subject: Releasing 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Where do we stand with this? I want to push to a 3.0.3 release, what items are still pending? Itamar.
Re: Corrupt index
I'm quite certain this shouldn't happen also when Commit wasn't called. Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens currens.ch...@gmail.com wrote: Well, the only thing I see is that there is no place where writer.Commit() is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add int numDocsAdded = 0;. At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 , about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko ita...@code972.com wrote: Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.com wrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able
Re: Releasing 3.0.3
Sorry, misread your question This can be easily done with xUnit, using Theories. On Thu, Jun 14, 2012 at 9:26 PM, Itamar Syn-Hershko ita...@code972.comwrote: Something like: Thread.CurrentThread.CurrentCulture = cultureInfo; Thread.CurrentThread.CurrentUICulture = cultureInfo; And setting it back later when the test is done. You can easily do this with an IDisposable like this: using(new TemporaryCulture(culture)){ ... } On Thu, Jun 14, 2012 at 9:10 PM, Simon Svensson si...@devhost.se wrote: I've been thinking about LUCENENET-493 (Make Lucene.Net culture insensitive). It's easy to fix the code, and verify it on my machine (running CurrentCulture=sv-SE), but there are no tests to confirm the changes. I've been looking for ways to build test cases for different cultures, like the overridden runBare method used originally in the java code, but NUnit does not seem to have any such abilities within the tests themselves. 1) It is possible to build NUnit addins that could execute every test [with special annotation?] once for every culture. Resharper supports NUnit addins, provided they are manually placed in the correct folder under the resharper application folder. 2) We could rewrite culture sensitive tests into method that holds the logic, and several test methods with [SetCulture(...)], but this requires knowledge about what tests are culture sensitive. We could also rewrite every method into a foreach-loop, executing the test logic with every culture. 3) Change unit testing framework. Any thoughts? On 2012-06-14 17:58, Prescott Nasser wrote: I'm going to try and review some of them - looking at the 3.5 ticket atm. The code should be in compliance with 3.0.3. We might want to do some spot checking various parts of the code. I'm not sure about the tests. Also, we should probably run some code coverage tools to see how much coverage we have. ~P Date: Thu, 14 Jun 2012 18:37:12 +0300 Subject: Re: Releasing 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.**orglucene-net-dev@lucene.apache.org Ok, and is the code in 100% compliance with the 3.0.3 Java code? I'll be spending some time on fixing the index corruption issue, and it is probably best for Chris to wrap up the work he has started Anyone else on board to close some tickets? On Thu, Jun 14, 2012 at 6:19 PM, Prescott Nassergeobmx...@hotmail.com **wrote: Agreed - JIRA for 3.0.3 https://issues.apache.org/**jira/browse/LUCENENET/** fixforversion/12316215#**selectedTab=com.atlassian.** jira.plugin.system.project%**3Aversion-issues-panelhttps://issues.apache.org/jira/browse/LUCENENET/fixforversion/12316215#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel We should evaluate all of these - fix them, mark as won't fix, or move them to another release version. I think the biggest hold up currently is: https://issues.apache.org/**jira/browse/LUCENENET-484https://issues.apache.org/jira/browse/LUCENENET-484. Chris has made a huge dent, but there are two test cases that are still listed as failing (I can't even duplicate those failures to know where to start) Also we should look at all the other jira tickets and make updates where appropriate ~P Date: Thu, 14 Jun 2012 13:21:04 +0300 Subject: Releasing 3.0.3 From: ita...@code972.com To: lucene-net-dev@lucene.apache.**orglucene-net-dev@lucene.apache.org Where do we stand with this? I want to push to a 3.0.3 release, what items are still pending? Itamar.
Re: Corrupt index
I can confirm 2.9.4 had autoCommit, but it is gone in 3.0.3 already, so Lucene.Net doesn't have autoCommit. So I don't have autoCommit set to true, but I can clearly see a segments_1 file there along with the other files. If that helpes, it always keeps with the name segments_1 with 32 bytes, never changes. And as again, if I kill the process and try to open the index with Luke 3.3, the index folder is being wiped out. Not sure what to make of all that. On Fri, Jun 15, 2012 at 3:21 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm, OK: in 2.9.4 / 3.0.x, if you open IW on a new directory, it will make a zero-segment commit. This was changed/fixed in 3.1 with LUCENE-2386. In 2.9.x (not 3.0.x) there is still an autoCommit parameter, defaulting to false, but if you set it to true then IndexWriter will periodically commit. Seeing segment files created and merge is definitely expected, but it's not expected to see segments_N files unless you pass autoCommit=true. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 8:14 PM, Itamar Syn-Hershko ita...@code972.com wrote: Not what I'm seeing. I actually see a lot of segments created and merged while it operates. Expected? Reminding you, this is 2.9.4 / 3.0.3 On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless luc...@mikemccandless.com wrote: Right: Lucene never autocommits anymore ... If you create a new index, add a bunch of docs, and things crash before you have a chance to commit, then there is no index (not even a 0 doc one) in that directory. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm quite certain this shouldn't happen also when Commit wasn't called. Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens currens.ch...@gmail.com wrote: Well, the only thing I see is that there is no place where writer.Commit() is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add int numDocsAdded = 0;. At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 , about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko ita...@code972.comwrote: Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.comwrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking
Re: Corrupt index
I'm quite certain this shouldn't happen also when Commit wasn't called. Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens currens.ch...@gmail.com wrote: Well, the only thing I see is that there is no place where writer.Commit() is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add int numDocsAdded = 0;. At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 , about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko ita...@code972.com wrote: Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.com wrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able
Re: Corrupt index
Not what I'm seeing. I actually see a lot of segments created and merged while it operates. Expected? Reminding you, this is 2.9.4 / 3.0.3 On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless luc...@mikemccandless.com wrote: Right: Lucene never autocommits anymore ... If you create a new index, add a bunch of docs, and things crash before you have a chance to commit, then there is no index (not even a 0 doc one) in that directory. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm quite certain this shouldn't happen also when Commit wasn't called. Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens currens.ch...@gmail.com wrote: Well, the only thing I see is that there is no place where writer.Commit() is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add int numDocsAdded = 0;. At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 , about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko ita...@code972.comwrote: Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.comwrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThingswith a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done
Re: Corrupt index
I can confirm 2.9.4 had autoCommit, but it is gone in 3.0.3 already, so Lucene.Net doesn't have autoCommit. So I don't have autoCommit set to true, but I can clearly see a segments_1 file there along with the other files. If that helpes, it always keeps with the name segments_1 with 32 bytes, never changes. And as again, if I kill the process and try to open the index with Luke 3.3, the index folder is being wiped out. Not sure what to make of all that. On Fri, Jun 15, 2012 at 3:21 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm, OK: in 2.9.4 / 3.0.x, if you open IW on a new directory, it will make a zero-segment commit. This was changed/fixed in 3.1 with LUCENE-2386. In 2.9.x (not 3.0.x) there is still an autoCommit parameter, defaulting to false, but if you set it to true then IndexWriter will periodically commit. Seeing segment files created and merge is definitely expected, but it's not expected to see segments_N files unless you pass autoCommit=true. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 8:14 PM, Itamar Syn-Hershko ita...@code972.com wrote: Not what I'm seeing. I actually see a lot of segments created and merged while it operates. Expected? Reminding you, this is 2.9.4 / 3.0.3 On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless luc...@mikemccandless.com wrote: Right: Lucene never autocommits anymore ... If you create a new index, add a bunch of docs, and things crash before you have a chance to commit, then there is no index (not even a 0 doc one) in that directory. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm quite certain this shouldn't happen also when Commit wasn't called. Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens currens.ch...@gmail.com wrote: Well, the only thing I see is that there is no place where writer.Commit() is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add int numDocsAdded = 0;. At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 , about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko ita...@code972.comwrote: Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.comwrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking
Re: Corrupt index
Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. Had no commit completed (no segments file written)? If you don't fsync then all sorts of crazy things are possible... I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 broke...). And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. Definitely not expected behavior: assuming nothing is flipping bits, then on OS/JVM crash or power loss your index should be fine, just reverted to the last successful commit. What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot of things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by changes made to the 3.0 codebase. Hopefully it's just that you are missing fsync! Also
Corrupt index
Hi Java devs, I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot of things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by changes made to the 3.0 codebase. Also, is there any test in the suite checking for those scenarios? Will appreciate any help on this, Itamar.
Re: Corrupt index
Mike, On Wed, Jun 13, 2012 at 7:31 PM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. Definitely, as Christopher noted we are about to release a 3.0.3 compatible version, which is line-by-line port of the Java version. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Somewhat unrelated to this thread, but what should I expect to see? from time to time we do see write.lock present after an app-crash or power failure. Also, what are the steps that are expected to be performed in such cases? Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. Had no commit completed (no segments file written)? If you don't fsync then all sorts of crazy things are possible... Ok, so we do have fsync since LUCENE-1044 is present, and there were segments present from previous commits. Any idea what went wrong? I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328broke...). So 2328 broke 1044, and this was fixed only in 3.4, right? so 2328 made it to a 3.0.x release while the fix for it (3418) was only released in 3.4. Am I right? If this is the case, 2328 probably made it's way to Lucene.Net since we are using the released sources for porting, and we now need to apply 3418 in the current version. Does it make sense to just port FSDirectory from 3.4 to 3.0.3? or were there API or other changes that will make our life miserable if we do that? And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. Definitely not expected behavior: assuming nothing is flipping bits, then on OS/JVM crash or power loss your index should be fine, just reverted to the last successful commit. What I suspected. Will try to reproduce reliably - any recommendations? not really feeling like reinventing the wheel here... MockDirectoryWrapper wasn't ported yet as it appears to only appear in 3.4, and as you said it won't really help here anyway What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot of things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by changes made to the 3.0 codebase. Hopefully it's just that you are missing fsync! Also, is there any test in the suite checking for those scenarios? Our test framework has a sneaky MockDirectoryWrapper that, after a test finishes, goes and corrupts any unsync'd files and then verifies the index is still OK... it's good because it'll catch any times we are missing calls t sync, but, it's not low level enough such that if FSDir is failing to actually call fsync (that wsa the bug in LUCENE-3418) then it won't catch that... Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Corrupt index
Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. Had no commit completed (no segments file written)? If you don't fsync then all sorts of crazy things are possible... I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 broke...). And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. Definitely not expected behavior: assuming nothing is flipping bits, then on OS/JVM crash or power loss your index should be fine, just reverted to the last successful commit. What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot of things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by changes made to the 3.0 codebase. Hopefully it's just that you are missing fsync! Also
Re: Corrupt index
Yes, reproduced in first try. See attached program - I referenced it to current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko ita...@code972.comwrote: Christopher, I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens currens.ch...@gmail.com wrote: Mike, The codebase for lucene.net should be almost identical to java's 3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless luc...@mikemccandless.com wrote: Hi Itamar, One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko ita...@code972.com wrote: I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. Had no commit completed (no segments file written)? If you don't fsync then all sorts of crazy things are possible... I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 broke...). And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. Definitely not expected behavior: assuming nothing is flipping bits, then on OS/JVM crash or power loss your index should be fine, just reverted to the last successful commit. What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot
[jira] [Commented] (LUCENENET-438) replace java doc notation with ms style xml comments notation.
[ https://issues.apache.org/jira/browse/LUCENENET-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293983#comment-13293983 ] Itamar Syn-Hershko commented on LUCENENET-438: -- This should be made by a tool, really replace java doc notation with ms style xml comments notation. Key: LUCENENET-438 URL: https://issues.apache.org/jira/browse/LUCENENET-438 Project: Lucene.Net Issue Type: Improvement Components: Lucene.Net Contrib, Lucene.Net Core Affects Versions: Lucene.Net 2.9.4g Environment: all Reporter: michael herndon Labels: documentation, The are a ton of java doc style notations inside the xml code comments, i.e. {@link #IncrementToken} These need to use the ms xml code comment style if there is an existing equivalent. I'm not assigning this one. If you come across this on code you are working on, please take an extra few minutes to fix up the comments. If you need help, grab me on #lucene.net on irc or michaelherndon on skype. Just let me know who you are and what help you need. A guide for code documentation, it includes a table that shows JavaDoc and XML doc comment equivalents: https://cwiki.apache.org/confluence/display/LUCENENET/Documenting+Lucene.Net -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Sponsoring porting work
Hi devs, We are looking to sponsor porting work, to help with keeping up the pace of development and help Lucene.Net be closer to Java Lucene. Unfortunately the amount of work I can put on this is very limited, and being up to speed with Lucene is important to us, hence the idea to offer sponsorship. I'm not entirely sure how these things work under the Apache umbrella, but I'd imagine there isn't a real issue doing that. All work will be handed back to the project under the ASL of course. I'd appreciate any guidance if needed. In the meantime, interested parties are welcome to contact me privately. Itamar.
Re: EOLs in Code
Yes, seems like all of them. Will look into it. On Sat, Jun 2, 2012 at 9:25 PM, Stefan Bodewig bode...@apache.org wrote: On 2012-06-02, Itamar Syn-Hershko wrote: I'm using git-svn, with auto-crlf set to true I'm not familiar enough with git-svn. auto-crlf will cover the git side but I don't think it sets the eol-style property in svn. - I think this will cover it but let me know if my commits are bad... I think some of the files I touched with https://svn.apache.org/viewvc?view=revisionrevision=1344562 are from your commit. Thanks Stefan
New Spatial module checked in
I was finally able to get git and svn talk to one another, and pushed my recent changes into trunk. The new Spatial contrib is bearing the non-standard version of 2.9.9, on purpose. It also contains Spatial4n in a binary form, mimicking the way it works in Java Lucene. The few tests that present pass, but when run in a chain I get the following failure - hadn't had time to run it down: Test 'Lucene.Net.Contrib.Spatial.Test.Prefix.TestRecursivePrefixTreeStrategy.BaseRecursivePrefixTreeStrategyTestCase.testFilterWithVariableScanLevel' failed: Lucene.Net.Store.AlreadyClosedException : this IndexReader is closed Index\IndexReader.cs(204,0): at Lucene.Net.Index.IndexReader.EnsureOpen() Index\DirectoryReader.cs(497,0): at Lucene.Net.Index.DirectoryReader.DoReopen(Boolean openReadOnly, IndexCommit commit) Index\DirectoryReader.cs(462,0): at Lucene.Net.Index.DirectoryReader.Reopen() SpatialTestCase.cs(111,0): at Lucene.Net.Contrib.Spatial.Test.SpatialTestCase.commit() SpatialTestCase.cs(94,0): at Lucene.Net.Contrib.Spatial.Test.SpatialTestCase.addDocumentsAndCommit(List`1 documents) StrategyTestCase.cs(67,0): at Lucene.Net.Contrib.Spatial.Test.StrategyTestCase`1.getAddAndVerifyIndexedDocuments(String testDataFile) Prefix\BaseRecursivePrefixTreeStrategyTestCase.cs(53,0): at Lucene.Net.Contrib.Spatial.Test.Prefix.BaseRecursivePrefixTreeStrategyTestCase.testFilterWithVariableScanLevel() Ideas welcome.
Re: Welcome Simon Svensson as a new committer
Welcome! On Thu, May 24, 2012 at 9:40 PM, Digy digyd...@gmail.com wrote: Welcome Simon DIGY -Original Message- From: Prescott Nasser [mailto:geobmx...@hotmail.com] Sent: Thursday, May 24, 2012 10:06 AM To: lucene-net-dev@lucene.apache.org; lucene-net-u...@lucene.apache.org Subject: Welcome Simon Svensson as a new committer Hey All, Our roster is growing a bit, I'd like to welcome Simon as a new committer. Simon has been quite active on the user mailing list helping answer community questions, he also maintains a C# port of the lucene-hunspell project (java: http://code.google.com/p/lucene-hunspell/, Simons c# port: https://github.com/sisve/Lucene.Net.Analysis.Hunspell) which is commonly used for spell checking (but has a wide array of purposes. Please join me in welcoming Simon to the team, ~Prescott - Checked by AVG - www.avg.com Version: 2012.0.1913 / Virus Database: 2425/5019 - Release Date: 05/24/12
Re: Welcome Itamar Syn-Hershko as a new committer
Thanks guys On Wed, May 23, 2012 at 1:14 AM, zoolette gaufre...@gmail.com wrote: Welcome in Itamar ! 2012/5/22 Prescott Nasser geobmx...@hotmail.com Hey all, I'd like to officially welcome Itamar as a new committer. I know the community appreciates the work you've been doing with the Spatial contrib project and the past help you've provided on the mailing lists. Please join me in welcoming Itamar, ~Prescott
[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider
[ https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280126#comment-13280126 ] Itamar Syn-Hershko commented on LUCENENET-483: -- That's not the newest spatial module Get it from here https://github.com/synhershko/lucene.net/tree/spatial2trunk or the 2.9.4 compatible version here https://github.com/synhershko/lucene.net/tree/spatial Spatial Search skipping records when one location is close to origin, another one is away and radius is wider - Key: LUCENENET-483 URL: https://issues.apache.org/jira/browse/LUCENENET-483 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Environment: .Net framework 4.0 Reporter: Aleksandar Panov Labels: lucene, spatialsearch Fix For: Lucene.Net 3.0.3 Running a spatial query against two locations where one location is close to origin (less than a mile), another one is away (24 miles) and radius is wider (52 miles) returns only one result. Running query with a bit wider radius (53.8) returns 2 results. IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original Lucene.Spatial (2.9.4 version) library. {code} // Origin private double _lat = 42.350153; private double _lng = -71.061667; private const string LatField = lat; private const string LngField = lng; //Locations AddPoint(writer, Location 1, 42.0, -71.0); //24 miles away from origin AddPoint(writer, Location 2, 42.35, -71.06); //less than a mile [TestMethod] public void TestAntiM() { _directory = new RAMDirectory(); var writer = new IndexWriter(_directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); SetUpPlotter(2, 15); AddData(writer); _searcher = new IndexSearcher(_directory, true); //const double miles = 53.8; // Correct. Returns 2 Locations. const double miles = 52; // Incorrect. Returns 1 Location. Console.WriteLine(testAntiM); // create a distance query var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, LngField, CartesianTierPlotter.DefaltFieldPrefix, true); Console.WriteLine(dq); //create a term query to search against all documents Query tq = new TermQuery(new Term(metafile, doc)); var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); Sort sort = new Sort(new SortField(foo, dsort, false)); // Perform the search, using the term query, the distance filter, and the // distance sort TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); int results = hits.TotalHits; ScoreDoc[] scoreDocs = hits.ScoreDocs; // Get a list of distances Dictionaryint, Double distances = dq.DistanceFilter.Distances; Console.WriteLine(Distance Filter filtered: + distances.Count); Console.WriteLine(Results: + results); Console.WriteLine(=); Console.WriteLine(Distances should be 2 + distances.Count); Console.WriteLine(Results should be 2 + results); Assert.AreEqual(2, distances.Count); // fixed a store of only needed distances Assert.AreEqual(2, results); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider
[ https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280266#comment-13280266 ] Itamar Syn-Hershko commented on LUCENENET-483: -- This code isn't mine :) Try using the code from the spatial branch instead, this is what I'm using. The DLLs I linked to above are compile that way. Spatial Search skipping records when one location is close to origin, another one is away and radius is wider - Key: LUCENENET-483 URL: https://issues.apache.org/jira/browse/LUCENENET-483 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Environment: .Net framework 4.0 Reporter: Aleksandar Panov Labels: lucene, spatialsearch Fix For: Lucene.Net 3.0.3 Running a spatial query against two locations where one location is close to origin (less than a mile), another one is away (24 miles) and radius is wider (52 miles) returns only one result. Running query with a bit wider radius (53.8) returns 2 results. IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original Lucene.Spatial (2.9.4 version) library. {code} // Origin private double _lat = 42.350153; private double _lng = -71.061667; private const string LatField = lat; private const string LngField = lng; //Locations AddPoint(writer, Location 1, 42.0, -71.0); //24 miles away from origin AddPoint(writer, Location 2, 42.35, -71.06); //less than a mile [TestMethod] public void TestAntiM() { _directory = new RAMDirectory(); var writer = new IndexWriter(_directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); SetUpPlotter(2, 15); AddData(writer); _searcher = new IndexSearcher(_directory, true); //const double miles = 53.8; // Correct. Returns 2 Locations. const double miles = 52; // Incorrect. Returns 1 Location. Console.WriteLine(testAntiM); // create a distance query var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, LngField, CartesianTierPlotter.DefaltFieldPrefix, true); Console.WriteLine(dq); //create a term query to search against all documents Query tq = new TermQuery(new Term(metafile, doc)); var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); Sort sort = new Sort(new SortField(foo, dsort, false)); // Perform the search, using the term query, the distance filter, and the // distance sort TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); int results = hits.TotalHits; ScoreDoc[] scoreDocs = hits.ScoreDocs; // Get a list of distances Dictionaryint, Double distances = dq.DistanceFilter.Distances; Console.WriteLine(Distance Filter filtered: + distances.Count); Console.WriteLine(Results: + results); Console.WriteLine(=); Console.WriteLine(Distances should be 2 + distances.Count); Console.WriteLine(Results should be 2 + results); Assert.AreEqual(2, distances.Count); // fixed a store of only needed distances Assert.AreEqual(2, results); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider
[ https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280552#comment-13280552 ] Itamar Syn-Hershko commented on LUCENENET-483: -- Great. I updated the Lucene.Net.Contrib.Spatial.dll to be of version 2.9.9 to avoid future confusion. This is a unique version number suggesting this is a non-standard issued contrib - Java Lucene will only have this module in version 4.0. Spatial Search skipping records when one location is close to origin, another one is away and radius is wider - Key: LUCENENET-483 URL: https://issues.apache.org/jira/browse/LUCENENET-483 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Environment: .Net framework 4.0 Reporter: Aleksandar Panov Labels: lucene, spatialsearch Fix For: Lucene.Net 3.0.3 Running a spatial query against two locations where one location is close to origin (less than a mile), another one is away (24 miles) and radius is wider (52 miles) returns only one result. Running query with a bit wider radius (53.8) returns 2 results. IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original Lucene.Spatial (2.9.4 version) library. {code} // Origin private double _lat = 42.350153; private double _lng = -71.061667; private const string LatField = lat; private const string LngField = lng; //Locations AddPoint(writer, Location 1, 42.0, -71.0); //24 miles away from origin AddPoint(writer, Location 2, 42.35, -71.06); //less than a mile [TestMethod] public void TestAntiM() { _directory = new RAMDirectory(); var writer = new IndexWriter(_directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); SetUpPlotter(2, 15); AddData(writer); _searcher = new IndexSearcher(_directory, true); //const double miles = 53.8; // Correct. Returns 2 Locations. const double miles = 52; // Incorrect. Returns 1 Location. Console.WriteLine(testAntiM); // create a distance query var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, LngField, CartesianTierPlotter.DefaltFieldPrefix, true); Console.WriteLine(dq); //create a term query to search against all documents Query tq = new TermQuery(new Term(metafile, doc)); var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); Sort sort = new Sort(new SortField(foo, dsort, false)); // Perform the search, using the term query, the distance filter, and the // distance sort TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); int results = hits.TotalHits; ScoreDoc[] scoreDocs = hits.ScoreDocs; // Get a list of distances Dictionaryint, Double distances = dq.DistanceFilter.Distances; Console.WriteLine(Distance Filter filtered: + distances.Count); Console.WriteLine(Results: + results); Console.WriteLine(=); Console.WriteLine(Distances should be 2 + distances.Count); Console.WriteLine(Results should be 2 + results); Assert.AreEqual(2, distances.Count); // fixed a store of only needed distances Assert.AreEqual(2, results); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-462) Spatial Search skipping records with small radius e.g. 1 mile
[ https://issues.apache.org/jira/browse/LUCENENET-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13279642#comment-13279642 ] Itamar Syn-Hershko commented on LUCENENET-462: -- This is now fixed with the new spatial module https://issues.apache.org/jira/browse/LUCENENET-489 Spatial Search skipping records with small radius e.g. 1 mile - Key: LUCENENET-462 URL: https://issues.apache.org/jira/browse/LUCENENET-462 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.4 Environment: .Net framework 4.0 Reporter: Mark Rodseth Labels: lucene, spatialsearch Running a spatial query against a list of locations all within 1 mile of a location returns correct results for 2 miles, but incorrect results for 1 mile. For the one mile query, only 2 of the 8 rows are returned. Locations Test below: {code} // Origin private double _lat = 51.508129; private double _lng = -0.128005; private const string LatField = lat; private const string LngField = lng; // Locations AddPoint(writer, Location 1, 51.5073802128877, -0.124669075012207); AddPoint(writer, Location 2, 51.5091, -0.1235); AddPoint(writer, Location 3, 51.5093, -0.1232); AddPoint(writer, Location 4, 51.5112531582845, -0.12509822845459); AddPoint(writer, Location 5, 51.5107, -0.123); AddPoint(writer, Location 6, 51.512, -0.1246); AddPoint(writer, Location 8, 51.5088760101322, -0.143165588378906); AddPoint(writer, Location 9, 51.5087958793819, -0.143508911132813); {code} {code} [Test] public void TestAntiM() { _searcher = new IndexSearcher(_directory, true); const double miles = 1.0; // Bug? Only returns 2 locations. Should return 8. // const double miles = 2.0; // Correct. Returns 8 Locations. Console.WriteLine(testAntiM); // create a distance query var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, LngField, CartesianTierPlotter.DefaltFieldPrefix, true); Console.WriteLine(dq); //create a term query to search against all documents Query tq = new TermQuery(new Term(metafile, doc)); var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); Sort sort = new Sort(new SortField(foo, dsort, false)); // Perform the search, using the term query, the distance filter, and the // distance sort TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); int results = hits.totalHits; ScoreDoc[] scoreDocs = hits.scoreDocs; // Get a list of distances Dictionaryint, Double distances = dq.DistanceFilter.Distances; Console.WriteLine(Distance Filter filtered: + distances.Count); Console.WriteLine(Results: + results); Console.WriteLine(=); Console.WriteLine(Distances should be 8 + distances.Count); Console.WriteLine(Results should be 8 + results); Assert.AreEqual(8, distances.Count); // fixed a store of only needed distances Assert.AreEqual(8, results); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENENET-483) Spatial Search skipping records when one location is close to origin, another one is away and radius is wider
[ https://issues.apache.org/jira/browse/LUCENENET-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13279643#comment-13279643 ] Itamar Syn-Hershko commented on LUCENENET-483: -- This should be fixed with the new spatial module, can you check? https://issues.apache.org/jira/browse/LUCENENET-489 Spatial Search skipping records when one location is close to origin, another one is away and radius is wider - Key: LUCENENET-483 URL: https://issues.apache.org/jira/browse/LUCENENET-483 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Environment: .Net framework 4.0 Reporter: Aleksandar Panov Labels: lucene, spatialsearch Fix For: Lucene.Net 3.0.3 Running a spatial query against two locations where one location is close to origin (less than a mile), another one is away (24 miles) and radius is wider (52 miles) returns only one result. Running query with a bit wider radius (53.8) returns 2 results. IMPORTANT UPDATE: Problem can't be reproduced in Java, with using original Lucene.Spatial (2.9.4 version) library. {code} // Origin private double _lat = 42.350153; private double _lng = -71.061667; private const string LatField = lat; private const string LngField = lng; //Locations AddPoint(writer, Location 1, 42.0, -71.0); //24 miles away from origin AddPoint(writer, Location 2, 42.35, -71.06); //less than a mile [TestMethod] public void TestAntiM() { _directory = new RAMDirectory(); var writer = new IndexWriter(_directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); SetUpPlotter(2, 15); AddData(writer); _searcher = new IndexSearcher(_directory, true); //const double miles = 53.8; // Correct. Returns 2 Locations. const double miles = 52; // Incorrect. Returns 1 Location. Console.WriteLine(testAntiM); // create a distance query var dq = new DistanceQueryBuilder(_lat, _lng, miles, LatField, LngField, CartesianTierPlotter.DefaltFieldPrefix, true); Console.WriteLine(dq); //create a term query to search against all documents Query tq = new TermQuery(new Term(metafile, doc)); var dsort = new DistanceFieldComparatorSource(dq.DistanceFilter); Sort sort = new Sort(new SortField(foo, dsort, false)); // Perform the search, using the term query, the distance filter, and the // distance sort TopDocs hits = _searcher.Search(tq, dq.Filter, 1000, sort); int results = hits.TotalHits; ScoreDoc[] scoreDocs = hits.ScoreDocs; // Get a list of distances Dictionaryint, Double distances = dq.DistanceFilter.Distances; Console.WriteLine(Distance Filter filtered: + distances.Count); Console.WriteLine(Results: + results); Console.WriteLine(=); Console.WriteLine(Distances should be 2 + distances.Count); Console.WriteLine(Results should be 2 + results); Assert.AreEqual(2, distances.Count); // fixed a store of only needed distances Assert.AreEqual(2, results); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SOLR-3304) Add Solr support for the new Lucene spatial module
[ https://issues.apache.org/jira/browse/SOLR-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13279609#comment-13279609 ] Itamar Syn-Hershko commented on SOLR-3304: -- In continuation to the discussion on the spatial4j list, +1 for having all the tests with actual spatial logic reside in the Lucene spatial module, and have the Solr tests rely on that Add Solr support for the new Lucene spatial module -- Key: SOLR-3304 URL: https://issues.apache.org/jira/browse/SOLR-3304 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Bill Bell Assignee: David Smiley Labels: spatial Attachments: SOLR-3304_Solr_fields_for_Lucene_spatial_module.patch Get the Solr spatial module integrated with the lucene spatial module. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Spatial4n
What are you trying to do - to work on it or to incorporate my changes? I'm not done yet - everything was ported but there's some nasty failing test I'm hunting down atm. You should be able to commit all my changes back to SVN with gitdsvn, but you can also get the latest sources from here as a zipball https://github.com/synhershko/lucene.net/zipball/spatial There are some very good git tutorials - worth reading. Check github's for example. Basically you just do git clone git:// github.com/synhershko/lucene.net.git and git checkout spatial and you are done. You'll never want to go back to SVN :) On Thu, May 17, 2012 at 8:52 PM, Prescott Nasser geobmx...@hotmail.comwrote: Itamar - I'm terrible with git, the last two weekends I tried cutting out and making a patch of the work you've done with spatial with no luck (I do like learning new things so I wanted to give it a shot before reaching out). Do you know how to do that? Or some way to see all the changes files / lines in git? Sorry, I'm slow this month ;) ~P From: geobmx...@hotmail.com To: lucene-net-...@lucene.apache.org Subject: RE: Spatial4n Date: Thu, 3 May 2012 20:13:50 -0700 I'll try to give you a hand this weekend great work ~P Date: Fri, 4 May 2012 05:48:51 +0300 Subject: Re: Spatial4n From: ita...@code972.com To: lucene-net-...@lucene.apache.org Status update: The Spatial4j project is completely ported to .NET, including tests, all of which green. It is available from https://github.com/synhershko/Spatial4n The Lucene spatial module which takes dependency on spatial4j is also ported now: https://github.com/synhershko/lucene.net/tree/spatial . I had to hack around quite a lot there, and created many compatibility classes and methods, since that module was originally written for the Lucene 4 API. There is only one issue in FixedBitSet preventing it from compiling, I'll take a look at it sometime soon (or if any of you can have a look, that'd be great...) I'm now working on porting the spatial test suite. As before, any help will be appreciated. Itamar. On Thu, Apr 26, 2012 at 6:45 PM, Itamar Syn-Hershko ita...@code972.comwrote: Hi again, I completed the port of the external Spatial library, and now am moving to porting the Lucene integration. The library, Spatial4n, is under ASL2 and can be found here https://github.com/synhershko/Spatial4n Anyone who can chip in and help port the tests, that would greatly help. There are not so many :) Itamar.
Re: including external code under apache 2.0
ICLA signed and sent On Mon, Apr 30, 2012 at 11:27 AM, Stefan Bodewig bode...@apache.org wrote: On 2012-04-28, Itamar Syn-Hershko wrote: That mail from Stephan got lost in my inbox, so I never followed up on that. I guess now would be a good chance to tie up all loose ends. How do I do the ICLA? In addition to what Troy said, you can also fill in the text form and PGP-sign it when you send it by email. See http://www.apache.org/licenses/#clas Stefan
Re: Spatial4n
No, but let me know what do I need to do. On Sat, Apr 28, 2012 at 1:20 AM, Prescott Nasser geobmx...@hotmail.comwrote: Itamar, have you filed an ICLA? If so we are good to go on this, and I'll put this in place of the current spatial code in contrib From: geobmx...@hotmail.com To: lucene-net-dev@lucene.apache.org Subject: RE: Spatial4n Date: Thu, 26 Apr 2012 16:46:05 -0700 Hey Stefan - can you confirm that porting Spatial4n is ok to include in our contrib? It is also under the apache 2.0 license, but we wanted to be 100%. ~P Date: Thu, 26 Apr 2012 18:45:48 +0300 Subject: Spatial4n From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Hi again, I completed the port of the external Spatial library, and now am moving to porting the Lucene integration. The library, Spatial4n, is under ASL2 and can be found here https://github.com/synhershko/Spatial4n Anyone who can chip in and help port the tests, that would greatly help. There are not so many :) Itamar.
[jira] [Commented] (LUCENENET-484) Some possibly major tests intermittently fail
[ https://issues.apache.org/jira/browse/LUCENENET-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264399#comment-13264399 ] Itamar Syn-Hershko commented on LUCENENET-484: -- That's probably a matter of things not being cleaned up properly in some tests? (didn't actually look at the tests, just the immediate thing that comes to mind) Some possibly major tests intermittently fail -- Key: LUCENENET-484 URL: https://issues.apache.org/jira/browse/LUCENENET-484 Project: Lucene.Net Issue Type: Bug Components: Lucene.Net Core, Lucene.Net Test Affects Versions: Lucene.Net 3.0.3 Reporter: Christopher Currens Fix For: Lucene.Net 3.0.3 These tests will fail intermittently in Debug or Release mode, in the core test suite: # -Lucene.Net.Index:- #- -TestConcurrentMergeScheduler.TestFlushExceptions- # Lucene.Net.Store: #- TestLockFactory.TestStressLocks # Lucene.Net.Search: #- TestSort.TestParallelMultiSort # Lucene.Net.Util: #- TestFieldCacheSanityChecker.TestInsanity1 #- TestFieldCacheSanityChecker.TestInsanity2 #- (It's possible all of the insanity tests fail at one point or another) # Lucene.Net.Support #- TestWeakHashTableMultiThreadAccess.Test TestWeakHashTableMultiThreadAccess should be fine to remove along with the WeakHashTable in the Support namespace, since it's been replaced with WeakDictionary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Spatial4n
Hi again, I completed the port of the external Spatial library, and now am moving to porting the Lucene integration. The library, Spatial4n, is under ASL2 and can be found here https://github.com/synhershko/Spatial4n Anyone who can chip in and help port the tests, that would greatly help. There are not so many :) Itamar.
Spatial4n
Hi again, I completed the port of the external Spatial library, and now am moving to porting the Lucene integration. The library, Spatial4n, is under ASL2 and can be found here https://github.com/synhershko/Spatial4n Anyone who can chip in and help port the tests, that would greatly help. There are not so many :) Itamar.
Re: Spatial contrib bug fixing
Great The actual library lives outside of Lucene ( https://github.com/spatial4j/spatial4j ) and only some integration classes are within the Lucene project itself. I linked to the (long) discussions about this in my previous message. I will be following that approach with this port, and really hope there will be no API differences I won't be able to overcome. I'm going to start doing this sometime tomorrow, but my main efforts will be on Thursday. I can certainly use any help in dividing work etc - please anyone who can join on Thursday for live collaboration or later chip in the discussion. I'll keep you posted. On Wed, Apr 25, 2012 at 12:00 AM, Christopher Currens currens.ch...@gmail.com wrote: Yes, the contrib is a MESS. I've been favoring complete re-implementations over porting changes, since contrib has been in such a poor, overlooked state for so long. I'm not opposed to porting LSP over the Spatial contrib project in Java, though it will might some porting challenges both now, since Lucene versions are different, and as Lucene.NET evolves. It also might not, I'm not familiar with the LSP code. Contrib is just that, contributed software that is not part of the core library, and there will be projects in Java we can't port over. In fact, I think there are .NET specific contrib projects that aren't in java. Either way, my point is that I'm am happy and willing to have LSP included if that's going to wind up being better than Spatial. I think we can use all the help and contributions we can get in Lucene.NET. Of course, we'd need to look and see what is possible, with porting over LSP (not sure if it relies on any version specific features that may not yet be in 3.0.3). So, I say let's go for it, and if you need any help/want to divide work between other committers, we can arrange that, and create issues for it, that is, if the other committers don't object to this. On Tue, Apr 24, 2012 at 1:45 PM, Itamar Syn-Hershko ita...@code972.com wrote: Thanks for your reply. Aside from the original port which had many divergences from java, the only other issue applied to spatial is LUCENENET-431, which would be easy to include. That is not correct. LUCENENET-431 was committed, but some fixed from Java Lucene 3.0.3 are in as well. The whole thing is a mess. The reason for this mess is the amount of bugs in the original Java implementation of Spatial. This is also why it has been deprecated in 3.6: https://issues.apache.org/jira/browse/LUCENE-2599 I think the best route at this point is to port LSP aka Spatial4j to .NET and start using it as the Spatial module for Lucene.NET https://issues.apache.org/jira/browse/LUCENE-3795 This is a Java Lucene 4 feature, but the current spatial implementation is pretty unisable. I'm going to start looking into this, and would definitely appreciate your input. Itamar.
Re: guestimation on -pre nuget package.
If it is known to be stable for actual use, we RavenDB dev team will update to it in a branch and provide feedback. A -Pre nuget package released for every RC can definitely help here. On Tue, Apr 24, 2012 at 5:25 PM, Michael Herndon mhern...@wickedsoftware.net wrote: Do you all think we're at a point to do a -pre nuget package that users can tinker with and provide feedback? The -pre flag means that it is only meant to be a pre release in order to get feedback. We might get more feedback if we package the binaries. Those that pushed the last package, what do you think the amount of effort / time that will take to get something like this done? (I'm asking so that I can block off enough time in my schedule to do this. ) I'm guessing shouldn't be as rigorous as a typical apache release as its meant just to package a alpha/beta binaries, not an official RTW. - michael.
Re: Spatial contrib bug fixing
Uhm.. I was referring to the .NET port, which I can see DIGY ported Nevermind I will get it from the original commit @Prescott any idea re CartesianPolyFilterBuilder.GetBoxShape() is not an exact port - do you remember why? ? On Tue, Apr 24, 2012 at 12:26 AM, Christopher Currens currens.ch...@gmail.com wrote: It's in a weird place. And for the 3.0.3 version, its easiest to find the code in the tags, rather than branches. http://svn.apache.org/viewvc/lucene/java/tags/lucene_3_0_3/contrib/misc/src/java/org/apache/lucene/misc/ On Mon, Apr 23, 2012 at 2:20 PM, Prescott Nasser geobmx...@hotmail.com wrote: I'm having trouble finding chained filter in the java lucene svn... http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/?pathrev=990167amI looking around in the wrong place? Date: Mon, 23 Apr 2012 11:19:51 +0300 Subject: Re: Spatial contrib bug fixing From: ita...@code972.com To: lucene-net-...@lucene.apache.org One more thing - what's the deal with ChainedFilter? I can see a commit by DIGY on 7/7/2011 but it seems to have been removed since? On Mon, Apr 23, 2012 at 11:06 AM, Itamar Syn-Hershko ita...@code972.com wrote: For starters - CartesianPolyFilterBuilder.GetBoxShape() is not an exact port - do you remember why? Anyway, if it was never fully ported as you say maybe I'll just go ahead and complete that For your reference, here are 2 failing tests which pass in Java Lucene (can send the java file) - https://github.com/synhershko/lucene.net/commit/234da7eca7cb08be5a0c2a7375ffc3f4a03bfd92 On Mon, Apr 23, 2012 at 1:39 AM, Prescott Nasser geobmx...@hotmail.comwrote: I think that was a while ago, and I don't even recall if I fully ported it or just put up the start. I had some other stuff to deal with the last few months, so my memory is a bit lacking. I'll review the code, meanwhile ask whatever questions you have - lets get this fixed up. ~P Date: Sun, 22 Apr 2012 22:10:27 +0300 Subject: Spatial contrib bug fixing From: ita...@code972.com To: lucene-net-...@lucene.apache.org Hi all, We encountered several bugs with the Sparial contrb, and the ones we tested with Java Lucene worked there (with 2.9.4). There are about 3 open tickets in the Jira bug tracker on similar issues. I'm now sitting with the ultimate goal of fixing this once and for all, but some code parts are commented out in favor of other not line-by-line port of some implementations, without a comment giving reasons. I was wondering if there's anyone who could answer a few questions there, instead of me changing things back and forth? Git history (I use the Git mirror, yes) tells me Prescott Nasser is behind porting this - maybe he will have the answers? Cheers, Itamar.
Re: Spatial contrib bug fixing
Thanks for your reply. Aside from the original port which had many divergences from java, the only other issue applied to spatial is LUCENENET-431, which would be easy to include. That is not correct. LUCENENET-431 was committed, but some fixed from Java Lucene 3.0.3 are in as well. The whole thing is a mess. The reason for this mess is the amount of bugs in the original Java implementation of Spatial. This is also why it has been deprecated in 3.6: https://issues.apache.org/jira/browse/LUCENE-2599 I think the best route at this point is to port LSP aka Spatial4j to .NET and start using it as the Spatial module for Lucene.NET https://issues.apache.org/jira/browse/LUCENE-3795 This is a Java Lucene 4 feature, but the current spatial implementation is pretty unisable. I'm going to start looking into this, and would definitely appreciate your input. Itamar.
Re: Spatial contrib bug fixing
For starters - CartesianPolyFilterBuilder.GetBoxShape() is not an exact port - do you remember why? Anyway, if it was never fully ported as you say maybe I'll just go ahead and complete that For your reference, here are 2 failing tests which pass in Java Lucene (can send the java file) - https://github.com/synhershko/lucene.net/commit/234da7eca7cb08be5a0c2a7375ffc3f4a03bfd92 On Mon, Apr 23, 2012 at 1:39 AM, Prescott Nasser geobmx...@hotmail.comwrote: I think that was a while ago, and I don't even recall if I fully ported it or just put up the start. I had some other stuff to deal with the last few months, so my memory is a bit lacking. I'll review the code, meanwhile ask whatever questions you have - lets get this fixed up. ~P Date: Sun, 22 Apr 2012 22:10:27 +0300 Subject: Spatial contrib bug fixing From: ita...@code972.com To: lucene-net-dev@lucene.apache.org Hi all, We encountered several bugs with the Sparial contrb, and the ones we tested with Java Lucene worked there (with 2.9.4). There are about 3 open tickets in the Jira bug tracker on similar issues. I'm now sitting with the ultimate goal of fixing this once and for all, but some code parts are commented out in favor of other not line-by-line port of some implementations, without a comment giving reasons. I was wondering if there's anyone who could answer a few questions there, instead of me changing things back and forth? Git history (I use the Git mirror, yes) tells me Prescott Nasser is behind porting this - maybe he will have the answers? Cheers, Itamar.