[jira] [Created] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to Key: LUCENE-3358 URL: https://issues.apache.org/jira/browse/LUCENE-3358 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.3 Reporter: Trejkaz Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use. Here's a unit test: {code} @Test public void testHiraganaWithCombiningMarkDakuten() throws Exception { // Hiragana 'S' following by the combining mark dakuten TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099")); // Should be kept together. List expectedTokens = Arrays.asList("\u3055\u3099"); List actualTokens = new LinkedList(); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); while (stream.incrementToken()) { actualTokens.add(term.toString()); } assertEquals("Wrong tokens", expectedTokens, actualTokens); } {code} This code fails with: {noformat} java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]> {noformat} It seems as if the tokeniser is throwing away the combining mark entirely. 3.0's behaviour was also undesirable: {noformat} java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]> {noformat} But at least the token was there, so it was possible to write a filter to work around the issue. Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 129 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/129/ 2 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.MultiCoreExampleJettyTest.testMultiCore Error Message: Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.FileNotFoundException: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock (No such file or directory) at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392) at org.apache.solr.core.SolrCore.(SolrCore.java:562) at org.apache.solr.core.SolrCore.(SolrCore.java:509) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.FileNotFoundException: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock (No such file or directory) at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392) at org.apache.solr.core.SolrCore.(SolrCore.java:562) at org.apache.solr.core.SolrCore.(SolrCore.java:509) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet request: http://localhost:27238/example/core0/update?commit=true&waitSearcher=true&wt=javabin&version=2 Stack Trace: request: http://localhost:27238/example/core0/update?commit=true&waitSearcher=true&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:434) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104) at org.apache.solr.client.solrj.MultiCoreExampleTestBase.testMultiCore(MultiCoreExampleTestBas
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078578#comment-13078578 ] David Smiley commented on SOLR-2690: Hoss, thanks for elaborating on the distinction between the date literal and the DateMath timezone. I was conflating these issues in my mind -- silly me. > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Improvement >Affects Versions: 3.3 >Reporter: David Schlotfeldt > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-2691: --- Attachment: SOLR-2691.patch patch of persistence tests at the CoreContainer level (since that's where the bug was) that incorporates Yury's fix. the assertions could definitely be beefed up to sanity check more aspects of the serialization, and we should really also be testing that "load" works and parses all of these things back in in the expected way as well, but it's a start. The thing that's currently hanging me up is that somehow the test is leaking a SolrIndexSearcher reference. I thought maybe it was because of the SolrCores i was creating+registering and then ignoring, but if i try to close them i get an error about too many decrefs instead. I'll let miller figure it out > solr.xml persistence is broken for multicore (by SOLR-2331) > --- > > Key: SOLR-2691 > URL: https://issues.apache.org/jira/browse/SOLR-2691 > Project: Solr > Issue Type: Bug > Components: multicore >Affects Versions: 4.0 >Reporter: Yury Kats >Assignee: Mark Miller >Priority: Critical > Fix For: 4.0 > > Attachments: SOLR-2691.patch, jira2691.patch > > > With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin > command, > the solr.xml gets overwritten with only the last core, repeated as many times > as there are cores. > It used to work fine with a trunk build from a couple of months ago, so it > looks like > something broke solr.xml persistence. > It appears to have been introduced by SOLR-2331: > CoreContainer#persistFile creates the map for core attributes (coreAttribs) > outside > of the loop that iterates over cores. Therefore, all cores reuse the same map > of attributes > and hence only the values from the last core are preserved and used for all > cores in the list. > I'm running SolrCloud, using: > -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf > -DzkRun > I'm starting Solr with four cores listed in solr.xml: > > > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > > > I then issue a PERSIST request: > http://localhost:8983/solr/admin/cores?action=PERSIST > And the solr.xml turns into: > >zkClientTimeout="1" hostPort="8983" hostContext="solr"> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2691: -- Fix Version/s: 4.0 Assignee: Mark Miller > solr.xml persistence is broken for multicore (by SOLR-2331) > --- > > Key: SOLR-2691 > URL: https://issues.apache.org/jira/browse/SOLR-2691 > Project: Solr > Issue Type: Bug > Components: multicore >Affects Versions: 4.0 >Reporter: Yury Kats >Assignee: Mark Miller >Priority: Critical > Fix For: 4.0 > > Attachments: jira2691.patch > > > With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin > command, > the solr.xml gets overwritten with only the last core, repeated as many times > as there are cores. > It used to work fine with a trunk build from a couple of months ago, so it > looks like > something broke solr.xml persistence. > It appears to have been introduced by SOLR-2331: > CoreContainer#persistFile creates the map for core attributes (coreAttribs) > outside > of the loop that iterates over cores. Therefore, all cores reuse the same map > of attributes > and hence only the values from the last core are preserved and used for all > cores in the list. > I'm running SolrCloud, using: > -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf > -DzkRun > I'm starting Solr with four cores listed in solr.xml: > > > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > > > I then issue a PERSIST request: > http://localhost:8983/solr/admin/cores?action=PERSIST > And the solr.xml turns into: > >zkClientTimeout="1" hostPort="8983" hostContext="solr"> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2979) Simplify configuration API of contrib Query Parser
[ https://issues.apache.org/jira/browse/LUCENE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078549#comment-13078549 ] Adriano Crestani commented on LUCENE-2979: -- Hi Phillipe, Thanks for the patch. I just applied your patch for 3x. It looks good. As you removed TestAttributes, can you create another junit to test whether configuration is updated when an attribute (like CharTermAttribute) is updated, which is basically the new functionality of the newly deprecated query parser attributes. > Simplify configuration API of contrib Query Parser > -- > > Key: LUCENE-2979 > URL: https://issues.apache.org/jira/browse/LUCENE-2979 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/other >Affects Versions: 2.9, 3.0 >Reporter: Adriano Crestani >Assignee: Adriano Crestani > Labels: api-change, gsoc, gsoc2011, lucene-gsoc-11, mentor > Fix For: 3.4, 4.0 > > Attachments: LUCENE-2979_phillipe_ramalho_2.patch, > LUCENE-2979_phillipe_ramalho_3.patch, LUCENE-2979_phillipe_ramalho_3.patch, > LUCENE-2979_phillipe_ramalho_4_3x.patch, > LUCENE-2979_phillipe_ramalho_4_trunk.patch, > LUCENE-2979_phillipe_reamalho.patch > > > The current configuration API is very complicated and inherit the concept > used by Attribute API to store token information in token streams. However, > the requirements for both (QP config and token stream) are not the same, so > they shouldn't be using the same thing. > I propose to simplify QP config and make it less scary for people intending > to use contrib QP. The task is not difficult, it will just require a lot of > code change and figure out the best way to do it. That's why it's a good > candidate for a GSoC project. > I would like to hear good proposals about how to make the API more friendly > and less scaring :) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
CHANGES.txt for modules
I can see the description of changes made to the modules are still in contrib/CHANGES.txt. Are they going to be moved in future to a modules/CHANGES.txt?
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078538#comment-13078538 ] David Schlotfeldt commented on SOLR-2690: - Being able to specify dates in timezones other then GMT+0 isn't a problem. It would just be nice but we can gnore that. The time zone the DateMathParser is configured with is the issue (which it sounds like you understand.) My solution that changes the timezone DateMathParser is constructed with in SimpleFacet to parse start, end and gap isn't ideal. I went this route because I don't want to run a custom built Solr -- my solution allowed me to fix the "bug" by simply replacing the "facet" SearchComponent. Affecting all DateMathParsrs created for length of the request is what is really needed (which is what you said). I like your approach. It sounds like we are on the same page. So, can we get this added? :) Without time zone affecting DateMathParser the date faceting is useless (at least for 100% the situations I would use it for) By the way, I'm gald to see how many responses there have been. I'm happy to see how active this project is. :) > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Improvement >Affects Versions: 3.3 >Reporter: David Schlotfeldt > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Did solr.xml persistence brake?
: I opened SOLR-2691 to track and attached a patch. : : Would appreciate a quick look from a committer. Thanks! I'm not too familiar with that code, but i can definitely reproduce the bug ... i'll take a look at the existing tests and see if i can help out with your patch. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078530#comment-13078530 ] Hoss Man commented on SOLR-2689: Hmmm, ok ... it looks like maybe this bug was spured on by a recent mailing list thread about score filtering where someone referred to this even older thread with msg from Yonik... http://search-lucene.com/m/4AHNF17wIJW1/ ...based on his wording ("frange could possible help ... perhaps something like...", i don't think yonik really thought that answer through very hard, so it shouldn't be taken as gospel that he was advocating that solution would work (even though strictly speaking it does filter by score) let alone "will work and will still give you meaningful scores that you can sort on" If you want to filter by arbitrary score (and i won't bother to list all the reasons i think that is a bad idea) and still get those score back and be able to sort on them, then you still need the "q" to be a query that produces scores, and leave the filtering to an "fq"... {code} ?q=ipod&fl=*,score&fq={!frange+l=0.72}query($q) {code} > !frange with query($qq) sets score=1.0f for all returned documents > -- > > Key: SOLR-2689 > URL: https://issues.apache.org/jira/browse/SOLR-2689 > Project: Solr > Issue Type: Bug > Components: search >Affects Versions: 3.4 >Reporter: Markus Jelsma > Fix For: 3.4, 4.0 > > > Consider the following queries, both query the default field for 'test' and > return the document digest and score (i don't seem to be able get only score, > fl=score returns all fields): > This is a normal query and yields normal results with proper scores: > {code} > q=test&fl=digest,score > {code} > {code} > > − > > 4.952673 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 4.952673 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 4.952673 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} > This query uses frange with query() to limit the number of returned > documents. When using multiple search terms i can indeed cut-off the result > set but in the end all returned documents have score=1.0f. The final result > set cannot be sorted by score anymore. The result set seems to be returned in > the order of Lucene docId's. > {code} > q={!frange l=1.23}query($qq)&qq=test&fl=digest,score > {code} > {code} > > − > > 1.0 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 1.0 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 1.0 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Did solr.xml persistence brake?
On 8/2/2011 5:42 PM, Yury Kats wrote: > It used to work fine with a trunk build from a couple of months ago, so it > looks like > something broke solr.xml persistence. Can it be related to SOLR-2331? Looking at the code, it does seem like a regression from SOLR-2331. CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I opened SOLR-2691 to track and attached a patch. Would appreciate a quick look from a committer. Thanks! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yury Kats updated SOLR-2691: Attachment: jira2691.patch Patch. Create map of attributes inside the loop. > solr.xml persistence is broken for multicore (by SOLR-2331) > --- > > Key: SOLR-2691 > URL: https://issues.apache.org/jira/browse/SOLR-2691 > Project: Solr > Issue Type: Bug > Components: multicore >Affects Versions: 4.0 >Reporter: Yury Kats >Priority: Critical > Attachments: jira2691.patch > > > With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin > command, > the solr.xml gets overwritten with only the last core, repeated as many times > as there are cores. > It used to work fine with a trunk build from a couple of months ago, so it > looks like > something broke solr.xml persistence. > It appears to have been introduced by SOLR-2331: > CoreContainer#persistFile creates the map for core attributes (coreAttribs) > outside > of the loop that iterates over cores. Therefore, all cores reuse the same map > of attributes > and hence only the values from the last core are preserved and used for all > cores in the list. > I'm running SolrCloud, using: > -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf > -DzkRun > I'm starting Solr with four cores listed in solr.xml: > > > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > collection="hcpconf" /> > > > I then issue a PERSIST request: > http://localhost:8983/solr/admin/cores?action=PERSIST > And the solr.xml turns into: > >zkClientTimeout="1" hostPort="8983" hostContext="solr"> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > collection="hcpconf"/> > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2331) Refactor CoreContainer's SolrXML serialization code and improve testing
[ https://issues.apache.org/jira/browse/SOLR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078526#comment-13078526 ] Yury Kats commented on SOLR-2331: - Looks like this introduced a regression in solr.xml persistence. See SOLR-2691. > Refactor CoreContainer's SolrXML serialization code and improve testing > --- > > Key: SOLR-2331 > URL: https://issues.apache.org/jira/browse/SOLR-2331 > Project: Solr > Issue Type: Improvement > Components: multicore >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 4.0 > > Attachments: SOLR-2331-fix-windows-file-deletion-failure.patch, > SOLR-2331-fix-windows-file-deletion-failure.patch, SOLR-2331.patch > > > CoreContainer has enough code in it - I'd like to factor out the solr.xml > serialization code into SolrXMLSerializer or something - which should make > testing it much easier and lightweight. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
solr.xml persistence is broken for multicore (by SOLR-2331) --- Key: SOLR-2691 URL: https://issues.apache.org/jira/browse/SOLR-2691 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 4.0 Reporter: Yury Kats Priority: Critical With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. It appears to have been introduced by SOLR-2331: CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078522#comment-13078522 ] Hoss Man commented on SOLR-2689: I don't really understand why this is a bug? "frange" is the FunctionRangeQParserPlugin which produces ConstantScoreRangeQueries -- it doesn't matter when/how/why it's used (or that the function it's wrapping comes from an arbitrary query), it always produces range queries that generate constant scores. > !frange with query($qq) sets score=1.0f for all returned documents > -- > > Key: SOLR-2689 > URL: https://issues.apache.org/jira/browse/SOLR-2689 > Project: Solr > Issue Type: Bug > Components: search >Affects Versions: 3.4 >Reporter: Markus Jelsma > Fix For: 3.4, 4.0 > > > Consider the following queries, both query the default field for 'test' and > return the document digest and score (i don't seem to be able get only score, > fl=score returns all fields): > This is a normal query and yields normal results with proper scores: > {code} > q=test&fl=digest,score > {code} > {code} > > − > > 4.952673 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 4.952673 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 4.952673 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} > This query uses frange with query() to limit the number of returned > documents. When using multiple search terms i can indeed cut-off the result > set but in the end all returned documents have score=1.0f. The final result > set cannot be sorted by score anymore. The result set seems to be returned in > the order of Lucene docId's. > {code} > q={!frange l=1.23}query($qq)&qq=test&fl=digest,score > {code} > {code} > > − > > 1.0 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 1.0 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 1.0 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-2690: --- Issue Type: Improvement (was: Bug) bq. If I want midnight in Central time zone I shouldn't have to write: 2011-01-01T06:00:00Z (Note I wrote 6:00 not 0:00) "Central time zone" is a vague concept that may mean one thing to you, but may mean something different to someone else. For any arbitrary moment in the (one dimensional) space of time values, there are an infinite number of ways to represent that time as a string (or as number) depending on where you place your origin for the coordinate system. Requiring that clients format times in UTC is no different then requiring clients to use Arabic numerals to represent integers -- it's just a matter of making sure there is no ambiguity, and everyone is using the same definition of "0". UTC is a completely unambiguous coordinate system for times, that is guaranteed to work in any JVM that Solr might run on. Even if we added code to allow dates to be expressed in arbitrary user selected timezones, we couldn't make that garuntee. Bottom line: the issue of parsing/formatting times in other coordinate systems (ie: timezones) should not be convoluted with the issue of what timezone is used by the DateMathParser when rounding -- those are distinct issues. It's completely conceivable to have a QParser that accepts a variety of data formats and "guesses" what TZ is meant and use that QParser in the same request where you want date faceting based on a TZ that is specified distinctly from the query string (ie: user's local TZ is UTC-0700, but they are searching for records dated before "Dec 15, 2010 4:20PM EST") bq. So one possible alternative that needs more thought is a "TZ" request parameter that would apply by default to things that are date related. Right ... from the beginning DateMathparser was designed with the hope that a TZ/Locale pair could be specified per request (or per field declaration) for driving the rounding/math logic, there was just no sane way to specify an alternative to UTC/US that could be past down into the DateMathParser and used ubiquitously in a request because of the FieldType API. (Slight digression... bq. its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. ...just changing the TZ used by that instance of DateMathParser for rounding/math isn't going to do any good if the user then tries to filter on one of those constraints and the filter query code winds up using the defaults in DateField (ie: NOW/DAY and NOW/DAY+1HOUR are going to be very differnet things in the facet count code path vs the filter query code path)) Now that we have SolrRequestInfo and a request param to specify the meaning of "NOW", the same logic could be used to allow a request param to specify the TZ/Locale properties of the DateMathParser as well. But like I said: this should really only be used to affect the *math* in DateMathParser -- it should not be used in DateField.parseDate/formatDate because DateField by definition deals with a single canonical time format, by the time the DateField class is involved in dealing with a Date everything should be un-ambiguisly expressable in UTC. logic for parsing date strings that aren't in the canonical date format should be a QParser responsibility at query time, or an UpdateProcessor responsibility at index time. Logic for formatting dates in non-canonical format should be a ResponseWriter responsibility. This new request property we're talking about for defining the "users TZ" can certainly be used in all of these places to pick/override defaults, but that type of logic really doesn't belong in DateField. > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Improvement >Affects Versions: 3.3 >Reporter: David Schlotfeldt > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078467#comment-13078467 ] Markus Jelsma commented on SOLR-2689: - You are right, it's because both examples use one search term and thus all have the same score. It shows when not all scores are identical when you use multiple terms. I'll provide a better description and example next week when i'll get back. > !frange with query($qq) sets score=1.0f for all returned documents > -- > > Key: SOLR-2689 > URL: https://issues.apache.org/jira/browse/SOLR-2689 > Project: Solr > Issue Type: Bug > Components: search >Affects Versions: 3.4 >Reporter: Markus Jelsma > Fix For: 3.4, 4.0 > > > Consider the following queries, both query the default field for 'test' and > return the document digest and score (i don't seem to be able get only score, > fl=score returns all fields): > This is a normal query and yields normal results with proper scores: > {code} > q=test&fl=digest,score > {code} > {code} > > − > > 4.952673 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 4.952673 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 4.952673 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} > This query uses frange with query() to limit the number of returned > documents. When using multiple search terms i can indeed cut-off the result > set but in the end all returned documents have score=1.0f. The final result > set cannot be sorted by score anymore. The result set seems to be returned in > the order of Lucene docId's. > {code} > q={!frange l=1.23}query($qq)&qq=test&fl=digest,score > {code} > {code} > > − > > 1.0 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 1.0 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 1.0 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-1692: - Assignee: (was: Grant Ingersoll) > CarrotClusteringEngine produce summary does nothing > --- > > Key: SOLR-1692 > URL: https://issues.apache.org/jira/browse/SOLR-1692 > Project: Solr > Issue Type: Bug > Components: contrib - Clustering >Reporter: Grant Ingersoll > Fix For: 3.4, 4.0 > > Attachments: SOLR-1692.patch > > > In the CarrotClusteringEngine, the produceSummary option does nothing, as the > results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Did solr.xml persistence brake?
Hi, With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. Can it be related to SOLR-2331? I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078451#comment-13078451 ] Otis Gospodnetic commented on SOLR-2689: Markus - I can't even tell this frange call cuts-off any of the hits - you have numFound="227763" in both examples above. Am I missing something? :) > !frange with query($qq) sets score=1.0f for all returned documents > -- > > Key: SOLR-2689 > URL: https://issues.apache.org/jira/browse/SOLR-2689 > Project: Solr > Issue Type: Bug > Components: search >Affects Versions: 3.4 >Reporter: Markus Jelsma > Fix For: 3.4, 4.0 > > > Consider the following queries, both query the default field for 'test' and > return the document digest and score (i don't seem to be able get only score, > fl=score returns all fields): > This is a normal query and yields normal results with proper scores: > {code} > q=test&fl=digest,score > {code} > {code} > > − > > 4.952673 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 4.952673 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 4.952673 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} > This query uses frange with query() to limit the number of returned > documents. When using multiple search terms i can indeed cut-off the result > set but in the end all returned documents have score=1.0f. The final result > set cannot be sorted by score anymore. The result set seems to be returned in > the order of Lucene docId's. > {code} > q={!frange l=1.23}query($qq)&qq=test&fl=digest,score > {code} > {code} > > − > > 1.0 > c48e784f06a051d89f20b72194b0dcf0 > > − > > 1.0 > 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 > > − > > 1.0 > 0f7fefa6586ceda42fc1f095d460aa17 > > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078434#comment-13078434 ] Grant Ingersoll commented on LUCENE-2748: - I wonder if the best thing to do here is to simply start fresh and clean and simply leave all existing content up as is and link to it as the "old" content. > Convert all Lucene web properties to use the ASF CMS > > > Key: LUCENE-2748 > URL: https://issues.apache.org/jira/browse/LUCENE-2748 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll > > The new CMS has a lot of nice features (and some kinks to still work out) and > Forrest just doesn't cut it anymore, so we should move to the ASF CMS: > http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2143) Add OpenSearch resources to the bundled example
[ https://issues.apache.org/jira/browse/SOLR-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-2143: - Assignee: (was: Grant Ingersoll) > Add OpenSearch resources to the bundled example > > > Key: SOLR-2143 > URL: https://issues.apache.org/jira/browse/SOLR-2143 > Project: Solr > Issue Type: Wish > Components: documentation >Affects Versions: 4.0 > Environment: N/A >Reporter: Rich Cariens > Fix For: 4.0 > > Attachments: SOLR-2143.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > Guidance & samples on how to make a Solr instance OpenSearch-compliant feels > like it would add value to the user community. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078419#comment-13078419 ] David Schlotfeldt commented on SOLR-2690: - Okay I've modified my code to now take "facet.date.tz" instead. The time zone now affects the facet's start, end and gap values. > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Bug >Affects Versions: 3.3 >Reporter: David Schlotfeldt > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[Lucene.Net] [jira] [Resolved] (LUCENENET-404) Improve brand logo design
[ https://issues.apache.org/jira/browse/LUCENENET-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Troy Howard resolved LUCENENET-404. --- Resolution: Fixed Uploaded the artifacts in r1153264 > Improve brand logo design > - > > Key: LUCENENET-404 > URL: https://issues.apache.org/jira/browse/LUCENENET-404 > Project: Lucene.Net > Issue Type: Sub-task > Components: Project Infrastructure >Reporter: Troy Howard >Assignee: Troy Howard >Priority: Minor > Labels: branding, logo > Attachments: lucene-alternates.jpg, lucene-medium.png, > lucene-net-logo-display.jpg > > > The existing Lucene.Net logo leaves a lot to be desired. We'd like a new logo > that is modern and well designed. > To implement this, Troy is coordinating with StackOverflow/StackExchange to > manage a logo design contest, the results of which will be our new logo > design. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch EasySimilarity now computes norms in the same way as DefaultSimilarity. Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon? I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices. > Implement various ranking models as Similarities > > > Key: LUCENE-3220 > URL: https://issues.apache.org/jira/browse/LUCENE-3220 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/query/scoring, core/search >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey > Labels: gsoc, gsoc2011 > Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we > can finally work on implementing the standard ranking models. Currently DFR, > BM25 and LM are on the menu. > Done: > * {{EasyStats}}: contains all statistics that might be relevant for a > ranking algorithm > * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the > DocScorers and as much implementation detail as possible > * _BM25_: the current "mock" implementation might be OK > * _LM_ > * _DFR_ > * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078400#comment-13078400 ] Simon Willnauer commented on LUCENE-3030: - bq. These are huge speedups for the terms-dict intensive queries! oh boy! This is awesome! > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch, LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078384#comment-13078384 ] James Dyer commented on SOLR-2382: -- Lance, I do not have any scientific benchmarks, but I can tell you how we use BerkleyBackedCache and how it performs for us. In our main app, we fully re-index all our data every night (13+ million records). Its basically a 2-step process. First we run ~50 DIH handlers, each of which builds a cache from databases, flat files, etc. The caches partition the data 8-ways. Then a "master" DIH script does all the joining, runs transformers on the data, etc. We have all 8 invocations of this same "master" DIH config running simultaneously indexing to the same Solr core, so each DIH invocation is processing 1.6 million records directly out of caches, doing all the 1-many joins, running transformer code, indexing, etc. This takes 1-1/2 hours, so maybe 250-300 solr records get added per second. We're using fast local disks configured with raid-0 on an 8-core 64gb server. This app is running solr 1.4, using the original version of this patch, prior to my front-porting it to trunk. No doubt some of the time is spent contending for the Lucene index as all 8 DIH invocations are indexing at the same time . We also have another app that uses Solr4.0 with the patch I originally posted back in February, sharing hardware with the main app. This one has about 10 entities and uses a simple 1-dih-handler configuration. The parent entity drives directly off the database while all the child entities use SqlEntityProcessor with BerkleyBackedCache. There are only 25,000 fairly narrow records and we can re-index everything in about 10 minutes. This includes database time, indexing, running transformers, etc in addition to the cache overhead. The inspiration for this was that we were converting off of Endeca and we were relying on Endeca's "Forge" program to join & denormalize all of the data. Forge has a very fast disk-backed caching mechanism and I needed to match that performance with DIH. I'm pretty sure what we have here surpasses Forge. And we also get a big bonus in that it lets you persist caches and use them as a subsequent input. With Forge, we had to output the data into huge delimited text files and then use that as input for the next step... Hope this information gives you some idea if this will work for your use case. > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhea
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 125 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/125/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:<2> but was:<3> Stack Trace: junit.framework.AssertionFailedError: expected:<2> but was:<3> at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:198) Build Log (for compile errors): [...truncated 11154 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3343) Comparison operators >,>=,<,<= and = support as RangeQuery syntax in QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078372#comment-13078372 ] Adriano Crestani commented on LUCENE-3343: -- Hi Oliver, I was only able to make your patch work when I merged with LUCENE-3338, however LUCENE-3338 is only available for trunk, not 3x. I will need to figure out some other way to make it work on 3x. I plan to work on this soon, probably next weekend. > Comparison operators >,>=,<,<= and = support as RangeQuery syntax in > QueryParser > > > Key: LUCENE-3343 > URL: https://issues.apache.org/jira/browse/LUCENE-3343 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/queryparser >Reporter: Olivier Favre >Assignee: Adriano Crestani >Priority: Minor > Labels: parser, query > Fix For: 3.4, 4.0 > > Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch > > Original Estimate: 96h > Remaining Estimate: 96h > > To offer better interoperability with other search engines and to provide an > easier and more straight forward syntax, > the operators >, >=, <, <= and = should be available to express an open range > query. > They should at least work for numeric queries. > '=' can be made a synonym for ':'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078354#comment-13078354 ] Lance Norskog commented on SOLR-2382: - Hello- Are there any benchmark results with this patch? Given, say two tables with a million elements in a pairwise join, how well does this caching system work? > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-couple EntityProcessorBase from caching. > - Created a new interface, DIHCache & two implementations: > - SortedMapBackedCache - An in-memory cache, used as default with > CachedSqlEntityProcessor (now deprecated). > - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested > with je-4.1.6.jar >- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. > I believe this may be incompatible due to Generic Usage. >- NOTE: I did not modify the ant script to automatically get this jar, > so to use or evaluate this patch, download bdb-je from > http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html > > 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the > entity data to be cached (see EntityProcessorBase & DIHCacheProperties). > > 3. Partially De-couple SolrWriter from DocBuilder > - Created a new interface DIHWriter, & two implementations: >- SolrWriter (refactored) >- DIHCacheWriter (allows DIH to write ultimately to a Cache). > > 4. Create a new Entity Processor, DIHCacheProcessor, which reads a > persistent Cache as DIH Entity Input. > > 5. Support a "partition" parameter with both DIHCacheWriter and > DIHCacheProcessor to allow for easy partitioning of source entity data. > > 6. Change the semantics of entity.destroy() > - Previously, it was being called on each iteration of > DocBuilder.buildDocument(). > - Now it is does one-time cleanup tasks (like closing or deleting a > disk-backed cache) once the entity processor is completed. > - The only out-of-the-box entity processor that previously implemented > destroy() was LineEntitiyProcessor, so this is no
[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker
[ https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078353#comment-13078353 ] Robert Muir commented on SOLR-2688: --- I'll work up a patch, might tweak the example a bit for the time being, I'd like to err on the side of performance. Note: with LUCENE-3030, Mike has really sped this guy up again. > switch solr 4.0 example to DirectSpellChecker > - > > Key: SOLR-2688 > URL: https://issues.apache.org/jira/browse/SOLR-2688 > Project: Solr > Issue Type: Improvement > Components: spellchecker >Affects Versions: 4.0 >Reporter: Robert Muir > > For discussion: we might want to switch the Solr 4.0 example to use > DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078352#comment-13078352 ] David Schlotfeldt commented on SOLR-2690: - By extending FacetComponent (and having to resort to reflection) I added: facet.date.gap.tz The new parameter only affects the gap. The math done with processing the gap is the largest issue when it comes it date faceting in my mind. I would be more then happy to provide a patch to add this feature. No this doesn't address all timezone issues but at least it would address the main issue that makes date faceting, in my eyes, completely useless. I bet there are 100s of people out there using date faceting that don't realize it does NOT give correct results :) > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Bug >Affects Versions: 3.3 >Reporter: David Schlotfeldt > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078337#comment-13078337 ] Michael McCandless commented on LUCENE-3030: Here's the graph of the results: !BlockTree.png! > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch, LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3030: --- Attachment: BlockTree.png The block tree terms dict seems to be working... all tests pass w/ StandardTree codec. There's still more to do (many nocommits), but, I think the perf results should be close to what I finally commit: ||Task||QPS base||StdDev base||QPS blocktree||StdDev blocktree||Pct diff |IntNRQ|11.58|1.37|10.11|1.77|{color:red}35%{color}-{color:green}16%{color}| |Term|106.65|3.24|98.84|4.97|{color:red}14%{color}-{color:green}0%{color}| |Prefix3|30.83|1.36|28.64|2.42|{color:red}18%{color}-{color:green}5%{color}| |OrHighHigh|5.85|0.15|5.44|0.28|{color:red}14%{color}-{color:green}0%{color}| |OrHighMed|19.25|0.62|17.91|0.86|{color:red}14%{color}-{color:green}0%{color}| |Phrase|9.37|0.42|8.87|0.10|{color:red}10%{color}-{color:green}0%{color}| |TermBGroup1M|44.02|0.90|42.76|1.08|{color:red}7%{color}-{color:green}1%{color}| |TermGroup1M|37.68|0.65|36.95|0.74|{color:red}5%{color}-{color:green}1%{color}| |TermBGroup1M1P|47.16|2.94|46.36|0.16|{color:red}7%{color}-{color:green}5%{color}| |SpanNear|5.60|0.35|5.55|0.29|{color:red}11%{color}-{color:green}11%{color}| |SloppyPhrase|3.36|0.16|3.34|0.04|{color:red}6%{color}-{color:green}5%{color}| |Wildcard|35.15|1.30|35.05|2.42|{color:red}10%{color}-{color:green}10%{color}| |AndHighHigh|10.71|0.22|10.99|0.22|{color:red}1%{color}-{color:green}6%{color}| |AndHighMed|51.15|1.44|54.31|1.84|{color:green}0%{color}-{color:green}12%{color}| |Fuzzy1|31.63|0.55|66.15|1.35|{color:green}101%{color}-{color:green}117%{color}| |PKLookup|40.00|0.75|84.93|5.49|{color:green}94%{color}-{color:green}130%{color}| |Fuzzy2|33.78|0.82|89.59|2.46|{color:green}151%{color}-{color:green}179%{color}| |Respell|23.56|1.15|70.89|1.77|{color:green}179%{color}-{color:green}224%{color}| This is for a multi-segment index, 10 M wikipedia docs, using luceneutil. These are huge speedups for the terms-dict intensive queries! The two FuzzyQuerys and Respell get the speedup from the directly implemented intersect method, and the PKLookup gets gains because it can often avoid seeking since block tree's terms index can sometimes rule out terms by their prefix (though, this relies on the PK terms being "predictable" -- I use "%09d" w/ a counter, now; if you instead used something more random looking (GUIDs )I don't think we'd see gains). > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch, LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] >
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 124 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/124/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:639) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:99) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:174) Build Log (for compile errors): [...truncated 11177 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078292#comment-13078292 ] David commented on SOLR-2690: - Good point. Also, this isn't a bug but if we want a complete solution, we really need a way to specify times in other timezones. If I want midnight in Central time zone I shouldn't have to write: 2011-01-01T06:00:00Z (Note I wrote 6:00 not 0:00) I believe only DateField would have to be modified to make it possible to specify timezone. For a complete example if I wanted to facet blog posts by the date posted and the month: facet.date=blogPostDate facet.date.start=2011-01-01T00:00:00 facet.date.end=2012-01-01T00:00:00 facet.date.gap=+1MONTH timezone=America/Chicago Currently you would need to do the following. (Which actually gives close to correct results but not exact. Again, problem is the gap of +1MONTH doesn't take daylight savings into account so blog posts on the edge of ranges are counted in the wrong range. facet.date=blogPostDate facet.date.start=2011-01-01T00:06:00Z facet.date.end=2012-01-01T00:06:00Z facet.date.gap=+1MONTH > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Bug >Affects Versions: 3.3 >Reporter: David > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076307#comment-13076307 ] Yonik Seeley commented on SOLR-2690: Although this probably isn't a "bug", I agree that handling timezones somehow would be nice. We just need to think very carefully about the API so we can support it long term. One immediate thought I had was that it would be a pain to specify the timezone everywhere. Even a simple range query would need to specify it twice: my_date:["(?timeZone=America/Chicago)NOW/YEAR" TO "(?timeZone=America/Chicago)+1MONTH"] So one possible alternative that needs more thought is a "TZ" request parameter that would apply by default to things that are date related. > Date Faceting or Range Faceting doesn't take timezone into account. > --- > > Key: SOLR-2690 > URL: https://issues.apache.org/jira/browse/SOLR-2690 > Project: Solr > Issue Type: Bug >Affects Versions: 3.3 >Reporter: David > Original Estimate: 3h > Remaining Estimate: 3h > > Timezone needs to be taken into account when doing date math. Currently it > isn't. DateMathParser instances created are always being constructed with > UTC. This is a huge issue when it comes to faceting. Depending on your > timezone day-light-savings changes the length of a month. A facet gap of > +1MONTH is different depending on the timezone and the time of the year. > I believe the issue is very simple to fix. There are three places in the code > DateMathParser is created. All three are configured with the timezone being > UTC. If a user could specify the TimeZone to pass into DateMathParser this > faceting issue would be resolved. > Though it would be nice if we could always specify the timezone > DateMathParser uses (since date math DOES depend on timezone) its really only > essential that we can affect DateMathParser the SimpleFacets uses when > dealing with the gap of the date facets. > Another solution is to expand the syntax of the expressions DateMathParser > understands. For example we could allow "(?timeZone=VALUE)" to be added > anywhere within an expression. VALUE would be the id of the timezone. When > DateMathParser reads this in sets the timezone on the Calendar it is using. > Two examples: > - "(?timeZone=America/Chicago)NOW/YEAR" > - "(?timeZone=America/Chicago)+1MONTH" > I would be more then happy to modify DateMathParser and provide a patch. I > just need a committer to agree this needs to be resolved and a decision needs > to be made on the syntax used > Thanks! > David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone
[ https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076303#comment-13076303 ] David commented on SOLR-2525: - I have opened a new ticket about this: SOLR-2690 > Date Faceting or Range Faceting with offset doesn't convert timezone > > > Key: SOLR-2525 > URL: https://issues.apache.org/jira/browse/SOLR-2525 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis, search >Affects Versions: 3.1 > Environment: Solr 3.1 > Windows 2008 RC2 Server > Java 6 > Running on Jetty >Reporter: Rohit Gupta > Labels: date, facet > > I am trying to facet based on date field and apply user timezone offset so > that the faceted results are in user timezone. My faceted result is given > below, > > > > 0 > 6 > > true > icici >name="facet.range.start">2011-05-02T00:00:00Z+330MINUTES > createdOnGMTDate > 2011-05-18T00:00:00Z > +1DAY > > > > > > > 4 > 63 > 0 > 0 > .. > > +1DAY > 2011-05-02T05:30:00Z > 2011-05-18T05:30:00Z > > > > > Now if you notice that the response show 4 records for the 2th of May 2011 > which will fall in the IST timezone (+330MINUTES), but when I try to get the > results I see that there is only 1 result for the 2nd why is this happening. > > > > 0 > 5 > > createdOnGMTDate asc >name="fl">createdOnGMT,createdOnGMTDate,twtText >name="fq">createdOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *] > icici > > > > > Mon, 02 May 2011 16:27:05+ > 2011-05-02T16:27:05Z > #TechStrat615. Infosys (business soln & > IT > outsourcer) manages damages with new chairman > K.Kamath (ex ICICI > Bank chairman) to begin Aug 21. > > > Mon, 02 May 2011 19:00:44+ > 2011-05-02T19:00:44Z > how to get icici mobile banking > > > Tue, 03 May 2011 01:53:05+ > 2011-05-03T01:53:05Z > ICICI BANK LTD, L. M. MIRAJ branch in > SANGLI, > MAHARASHTRA. IFSC Code: ICIC0006537, MICR > Code: ... > http://bit.ly/fJCuWl #ifsc #micr #bank > > > Tue, 03 May 2011 01:53:05+ > 2011-05-03T01:53:05Z > ICICI BANK LTD, L. M. MIRAJ branch in > SANGLI, > MAHARASHTRA. IFSC Code: ICIC0006537, MICR > Code: ... > http://bit.ly/fJCuWl #ifsc #micr #bank > > > Tue, 03 May 2011 08:52:37+ > 2011-05-03T08:52:37Z > RT @nice4ufan: ICICI BANK PERSONAL LOAN > http://ee4you.blogspot.com/2011/04/icici-bank-personal-loan.html > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow "(?timeZone=VALUE)" to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - "(?timeZone=America/Chicago)NOW/YEAR" - "(?timeZone=America/Chicago)+1MONTH" I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile
[ https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076301#comment-13076301 ] Shawn Heisey commented on SOLR-1972: Hoss, the patch isn't my work, I just modified it to support a 100th percentile and reattached it. I am only just now beginning to learn Java. Although I have some clue what you're saying with static methods, actually doing it properly within a larger work like Solr is something I won't be able to do yet. > Need additional query stats in admin interface - median, 95th and 99th > percentile > - > > Key: SOLR-1972 > URL: https://issues.apache.org/jira/browse/SOLR-1972 > Project: Solr > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Shawn Heisey >Priority: Minor > Attachments: SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, > SOLR-1972.patch, elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, > elyograg-1972-trunk.patch, elyograg-1972-trunk.patch > > > I would like to see more detailed query statistics from the admin GUI. This > is what you can get now: > requests : 809 > errors : 0 > timeouts : 0 > totalTime : 70053 > avgTimePerRequest : 86.59209 > avgRequestsPerSecond : 0.8148785 > I'd like to see more data on the time per request - median, 95th percentile, > 99th percentile, and any other statistical function that makes sense to > include. In my environment, the first bunch of queries after startup tend to > take several seconds each. I find that the average value tends to be useless > until it has several thousand queries under its belt and the caches are > thoroughly warmed. The statistical functions I have mentioned would quickly > eliminate the influence of those initial slow queries. > The system will have to store individual data about each query. I don't know > if this is something Solr does already. It would be nice to have a > configurable count of how many of the most recent data points are kept, to > control the amount of memory the feature uses. The default value could be > something like 1024 or 4096. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone
[ https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076294#comment-13076294 ] David commented on SOLR-2525: - Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are told to use UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. > Date Faceting or Range Faceting with offset doesn't convert timezone > > > Key: SOLR-2525 > URL: https://issues.apache.org/jira/browse/SOLR-2525 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis, search >Affects Versions: 3.1 > Environment: Solr 3.1 > Windows 2008 RC2 Server > Java 6 > Running on Jetty >Reporter: Rohit Gupta > Labels: date, facet > > I am trying to facet based on date field and apply user timezone offset so > that the faceted results are in user timezone. My faceted result is given > below, > > > > 0 > 6 > > true > icici >name="facet.range.start">2011-05-02T00:00:00Z+330MINUTES > createdOnGMTDate > 2011-05-18T00:00:00Z > +1DAY > > > > > > > 4 > 63 > 0 > 0 > .. > > +1DAY > 2011-05-02T05:30:00Z > 2011-05-18T05:30:00Z > > > > > Now if you notice that the response show 4 records for the 2th of May 2011 > which will fall in the IST timezone (+330MINUTES), but when I try to get the > results I see that there is only 1 result for the 2nd why is this happening. > > > > 0 > 5 > > createdOnGMTDate asc >name="fl">createdOnGMT,createdOnGMTDate,twtText >name="fq">createdOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *] > icici > > > > > Mon, 02 May 2011 16:27:05+ > 2011-05-02T16:27:05Z > #TechStrat615. Infosys (business soln & > IT > outsourcer) manages damages with new chairman > K.Kamath (ex ICICI > Bank chairman) to begin Aug 21. > > > Mon, 02 May 2011 19:00:44+ > 2011-05-02T19:00:44Z > how to get icici mobile banking > > > Tue, 03 May 2011 01:53:05+ > 2011-05-03T01:53:05Z > ICICI BANK LTD, L. M. MIRAJ branch in > SANGLI, > MAHARASHTRA. IFSC Code: ICIC0006537, MICR > Code: ... > http://bit.ly/fJCuWl #ifsc #micr #bank > > > Tue, 03 May 2011 01:53:05+ > 2011-05-03T01:53:05Z > ICICI BANK LTD, L. M. MIRAJ branch in > SANGLI, > MAHARASHTRA. IFSC Code: ICIC0006537, MICR > Code: ... > http://bit.ly/fJCuWl #ifsc #micr #bank > > > Tue, 03 May 2011 08:52:37+ > 2011-05-03T08:52:37Z > RT @nice4ufan: ICICI BANK PERSONAL LOAN > http://ee4you.blogspot.com/2011/04/icici-bank-personal-loan.html > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Hatcher resolved SOLR-1032. Resolution: Fixed Fix Version/s: 4.0 Assignee: Erik Hatcher > CSV loader to support literal field values > -- > > Key: SOLR-1032 > URL: https://issues.apache.org/jira/browse/SOLR-1032 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.3 >Reporter: Erik Hatcher >Assignee: Erik Hatcher >Priority: Minor > Fix For: 4.0 > > Attachments: SOLR-1032.patch, SOLR-1032.patch > > > It would be very handy if the CSV loader could handle a literal field > mapping, like the extracting request handler does. For example, in a > scenario where you have multiple datasources (some data from a DB, some from > file crawls, and some from CSV) it is nice to add a field to every document > that specifies the data source. This is easily done with DIH with a template > transformer, and Solr Cell with ext.literal.datasource=, but impossible > currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
!frange with query($qq) sets score=1.0f for all returned documents -- Key: SOLR-2689 URL: https://issues.apache.org/jira/browse/SOLR-2689 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4 Reporter: Markus Jelsma Fix For: 3.4, 4.0 Consider the following queries, both query the default field for 'test' and return the document digest and score (i don't seem to be able get only score, fl=score returns all fields): This is a normal query and yields normal results with proper scores: {code} q=test&fl=digest,score {code} {code} − 4.952673 c48e784f06a051d89f20b72194b0dcf0 − 4.952673 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 − 4.952673 0f7fefa6586ceda42fc1f095d460aa17 {code} This query uses frange with query() to limit the number of returned documents. When using multiple search terms i can indeed cut-off the result set but in the end all returned documents have score=1.0f. The final result set cannot be sorted by score anymore. The result set seems to be returned in the order of Lucene docId's. {code} q={!frange l=1.23}query($qq)&qq=test&fl=digest,score {code} {code} − 1.0 c48e784f06a051d89f20b72194b0dcf0 − 1.0 7f78a504b8cbd86c6cdbf2aa2c4ae5e3 − 1.0 0f7fefa6586ceda42fc1f095d460aa17 {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076259#comment-13076259 ] Jan Høydahl commented on SOLR-1979: --- This has been tested on a real, several hundred thousand docs dataset, including HTML, office docs and multiple other formats and it works well. I'd like some more pairs of eyes on this however. One thing which is less than perfect is that the threshold conversion from Tika currently parses out the (internal) distance value from a String, in lack of a getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a beneficial one since we can now configure langid.threshold to something meaningful for our own data instead of the preset binary isReasonablyCertain(). As we also normalize to a value between 0-1, we abstract away the TIKA implementation detail, and are free to use any improved distance measures from TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika identifier or a hybrid solution. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2454) Would like link in site navigation to the ManifoldCF project
[ https://issues.apache.org/jira/browse/SOLR-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated SOLR-2454: -- Attachment: SOLR-2454.patch Patch for site reference to ManifoldCF > Would like link in site navigation to the ManifoldCF project > > > Key: SOLR-2454 > URL: https://issues.apache.org/jira/browse/SOLR-2454 > Project: Solr > Issue Type: Improvement > Components: documentation >Reporter: Karl Wright >Priority: Minor > Attachments: SOLR-2454.patch > > > The Solr/Lucene site points to lots of other Apache projects. It would be > nice if it also pointed to ManifoldCF. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076247#comment-13076247 ] Simon Rosenthal commented on SOLR-1032: --- revised patch looks good - do commit. > CSV loader to support literal field values > -- > > Key: SOLR-1032 > URL: https://issues.apache.org/jira/browse/SOLR-1032 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.3 >Reporter: Erik Hatcher >Priority: Minor > Attachments: SOLR-1032.patch, SOLR-1032.patch > > > It would be very handy if the CSV loader could handle a literal field > mapping, like the extracting request handler does. For example, in a > scenario where you have multiple datasources (some data from a DB, some from > file crawls, and some from CSV) it is nice to add a field to every document > that specifies the data source. This is easily done with DIH with a template > transformer, and Solr Cell with ext.literal.datasource=, but impossible > currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker
[ https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076237#comment-13076237 ] Mark Miller commented on SOLR-2688: --- +1 - not only is it better in almost every way IMO, but it lets you avoid the very nasty IndexReader leak in the current index based API. > switch solr 4.0 example to DirectSpellChecker > - > > Key: SOLR-2688 > URL: https://issues.apache.org/jira/browse/SOLR-2688 > Project: Solr > Issue Type: Improvement > Components: spellchecker >Affects Versions: 4.0 >Reporter: Robert Muir > > For discussion: we might want to switch the Solr 4.0 example to use > DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076232#comment-13076232 ] Jan Høydahl commented on SOLR-1032: --- Nice. > CSV loader to support literal field values > -- > > Key: SOLR-1032 > URL: https://issues.apache.org/jira/browse/SOLR-1032 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.3 >Reporter: Erik Hatcher >Priority: Minor > Attachments: SOLR-1032.patch, SOLR-1032.patch > > > It would be very handy if the CSV loader could handle a literal field > mapping, like the extracting request handler does. For example, in a > scenario where you have multiple datasources (some data from a DB, some from > file crawls, and some from CSV) it is nice to add a field to every document > that specifies the data source. This is easily done with DIH with a template > transformer, and Solr Cell with ext.literal.datasource=, but impossible > currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076211#comment-13076211 ] Erik Hatcher commented on SOLR-1032: Simon - thanks for the effort on this! I have taken a look and updated the patch with a test case and a change to use _literal.field_name=value_ convention. I think for the sake of this feature, it's best to stick with the established Solr Cell convention. Perhaps in another issue we can take up refactoring parameter naming for this capability. Thoughts? Objections? I'll commit this to trunk once I hear Simon's signoff. > CSV loader to support literal field values > -- > > Key: SOLR-1032 > URL: https://issues.apache.org/jira/browse/SOLR-1032 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.3 >Reporter: Erik Hatcher >Priority: Minor > Attachments: SOLR-1032.patch, SOLR-1032.patch > > > It would be very handy if the CSV loader could handle a literal field > mapping, like the extracting request handler does. For example, in a > scenario where you have multiple datasources (some data from a DB, some from > file crawls, and some from CSV) it is nice to add a field to every document > that specifies the data source. This is easily done with DIH with a template > transformer, and Solr Cell with ext.literal.datasource=, but impossible > currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Hatcher updated SOLR-1032: --- Attachment: SOLR-1032.patch Attached is a patch adding a test case and switching to use the Solr Cell established convention of _literal.field_name=value_ parameter naming. > CSV loader to support literal field values > -- > > Key: SOLR-1032 > URL: https://issues.apache.org/jira/browse/SOLR-1032 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 1.3 >Reporter: Erik Hatcher >Priority: Minor > Attachments: SOLR-1032.patch, SOLR-1032.patch > > > It would be very handy if the CSV loader could handle a literal field > mapping, like the extracting request handler does. For example, in a > scenario where you have multiple datasources (some data from a DB, some from > file crawls, and some from CSV) it is nice to add a field to every document > that specifies the data source. This is easily done with DIH with a template > transformer, and Solr Cell with ext.literal.datasource=, but impossible > currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker
[ https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076201#comment-13076201 ] Michael McCandless commented on SOLR-2688: -- +1 > switch solr 4.0 example to DirectSpellChecker > - > > Key: SOLR-2688 > URL: https://issues.apache.org/jira/browse/SOLR-2688 > Project: Solr > Issue Type: Improvement > Components: spellchecker >Affects Versions: 4.0 >Reporter: Robert Muir > > For discussion: we might want to switch the Solr 4.0 example to use > DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076199#comment-13076199 ] Michael McCandless commented on LUCENE-3030: bq. One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum. +1 -- I think CA should serve up a TermsEnum when provided a Terms? > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker
switch solr 4.0 example to DirectSpellChecker - Key: SOLR-2688 URL: https://issues.apache.org/jira/browse/SOLR-2688 Project: Solr Issue Type: Improvement Components: spellchecker Affects Versions: 4.0 Reporter: Robert Muir For discussion: we might want to switch the Solr 4.0 example to use DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076178#comment-13076178 ] Robert Muir commented on LUCENE-3030: - Also, we should measure if a "prefix automaton" with intersect() is faster than PrefixTermsEnum (I suspect it could be!) If this is true, we might want to not rewrite to prefixtermsenum anymore, instead changing PrefixQuery to extend AutomatonQuery too. > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict & index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076177#comment-13076177 ] Robert Muir commented on LUCENE-3030: - This is awesome, i really like adding the intersect() hook! Thanks for making a branch, I will check it out and try to dive in to help with some of this :) One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum. this way, if it can be accomplished with a simpler enum like just terms.iterator() or prefixtermsenum etc etc we get that optimization always. > Block tree terms dict & index > - > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, > LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076171#comment-13076171 ] Robert Muir commented on LUCENE-3220: - Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat? {noformat} public byte computeNorm(FieldInvertState state) { final int numTerms; if (discountOverlaps) numTerms = state.getLength() - state.getNumOverlap(); else numTerms = state.getLength(); return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms; } {noformat} for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down. > Implement various ranking models as Similarities > > > Key: LUCENE-3220 > URL: https://issues.apache.org/jira/browse/LUCENE-3220 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/query/scoring, core/search >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey > Labels: gsoc, gsoc2011 > Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we > can finally work on implementing the standard ranking models. Currently DFR, > BM25 and LM are on the menu. > Done: > * {{EasyStats}}: contains all statistics that might be relevant for a > ranking algorithm > * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the > DocScorers and as much implementation detail as possible > * _BM25_: the current "mock" implementation might be OK > * _LM_ > * _DFR_ > * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv
[ https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076156#comment-13076156 ] Robert Muir commented on LUCENE-3335: - I don't think there is any sense in this, who cares? We reported this crash to Oracle in plenty of time, and the *worse* wrong-results bug has been open since May 13: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738, but Oracle decided not to fix that, too. > jrebug causes porter stemmer to sigsegv > --- > > Key: LUCENE-3335 > URL: https://issues.apache.org/jira/browse/LUCENE-3335 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, > 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, > 3.3, 3.4, 4.0 > Environment: - JDK 7 Preview Release, GA (may also affect update _1, > targeted fix is JDK 1.7.0_2) > - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts >Reporter: Robert Muir >Assignee: Robert Muir > Labels: Java7 > Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, > patch-0uwe.patch > > > happens easily on java7: ant test -Dtestcase=TestPorterStemFilter > -Dtests.iter=100 > might happen on 1.6.0_u26 too, a user reported something that looks like the > same bug already: > http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv
[ https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076132#comment-13076132 ] Uwe Schindler commented on LUCENE-3335: --- @Shay: Sorry I did not want to be too italian :-) I just wanted to ensure that such configurations, leading to bugs in JVMs, would be reported to us. It would help us to also respond quicker on such bug reports, like the one we already got 2 months ago (which nobody was able to reproduce, as we did not know that the user used aggressive opts). > jrebug causes porter stemmer to sigsegv > --- > > Key: LUCENE-3335 > URL: https://issues.apache.org/jira/browse/LUCENE-3335 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, > 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, > 3.3, 3.4, 4.0 > Environment: - JDK 7 Preview Release, GA (may also affect update _1, > targeted fix is JDK 1.7.0_2) > - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts >Reporter: Robert Muir >Assignee: Robert Muir > Labels: Java7 > Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, > patch-0uwe.patch > > > happens easily on java7: ant test -Dtestcase=TestPorterStemFilter > -Dtests.iter=100 > might happen on 1.6.0_u26 too, a user reported something that looks like the > same bug already: > http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2687) Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news.
Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news. - Key: SOLR-2687 URL: https://issues.apache.org/jira/browse/SOLR-2687 Project: Solr Issue Type: Task Reporter: Julian Copes Find below the news of the new Solr book. I can provide an image when prompted. Below is a news item and I've included the URL for the new book. The text is as follows: Rafał Kuć is proud to introduce a new book on Solr, "Apache Solr 3.1 Cookbook" from Packt Publishing. The Solr 3.1 Cookbook will make your everyday work easier by using real-life examples that show you how to deal with the most common problems that can arise while using the Apache Solr search engine. This cookbook will show you how to get the most out of your search engine. Each chapter covers a different aspect of working with Solr from analyzing your text data through querying, performance improvement, and developing your own modules. The practical recipes will help you to quickly solve common problems with data analysis, show you how to use faceting to collect data and to speed up the performance of Solr. You will learn about functionalities that most newbies are unaware of, such as sorting results by a function value, highlighting matched words, and computing statistics to make your work with Solr easy and stress free. Click here to read more about the Apache Solr 3.1 Cookbook. (http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/book) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv
[ https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076114#comment-13076114 ] Dawid Weiss commented on LUCENE-3335: - Uwe has an Italian temper :) Btw. I really like the recent Yoda-discussion on concurrency-interest, Shay... > jrebug causes porter stemmer to sigsegv > --- > > Key: LUCENE-3335 > URL: https://issues.apache.org/jira/browse/LUCENE-3335 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, > 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, > 3.3, 3.4, 4.0 > Environment: - JDK 7 Preview Release, GA (may also affect update _1, > targeted fix is JDK 1.7.0_2) > - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts >Reporter: Robert Muir >Assignee: Robert Muir > Labels: Java7 > Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, > patch-0uwe.patch > > > happens easily on java7: ant test -Dtestcase=TestPorterStemFilter > -Dtests.iter=100 > might happen on 1.6.0_u26 too, a user reported something that looks like the > same bug already: > http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3357: Labels: gsoc gsoc2011 (was: ) > Unit and integration test cases for the new Similarities > > > Key: LUCENE-3357 > URL: https://issues.apache.org/jira/browse/LUCENE-3357 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/query/scoring >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey >Priority: Minor > Labels: gsoc, gsoc2011, test > Fix For: flexscoring branch > > > Write test cases to test the new Similarities added in > [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of > test cases will be created: > * unit tests, in which mock statistics are provided to the Similarities and > the score is validated against hand calculations; > * integration tests, in which a small collection is indexed and then > searched using the Similarities. > Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3357: Labels: gsoc gsoc2011 test (was: gsoc gsoc2011) > Unit and integration test cases for the new Similarities > > > Key: LUCENE-3357 > URL: https://issues.apache.org/jira/browse/LUCENE-3357 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/query/scoring >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey >Priority: Minor > Labels: gsoc, gsoc2011, test > Fix For: flexscoring branch > > > Write test cases to test the new Similarities added in > [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of > test cases will be created: > * unit tests, in which mock statistics are provided to the Similarities and > the score is validated against hand calculations; > * integration tests, in which a small collection is indexed and then > searched using the Similarities. > Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Component/s: core/query/scoring Labels: gsoc gsoc2011 (was: gsoc) > Implement various ranking models as Similarities > > > Key: LUCENE-3220 > URL: https://issues.apache.org/jira/browse/LUCENE-3220 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/query/scoring, core/search >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey > Labels: gsoc, gsoc2011 > Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we > can finally work on implementing the standard ranking models. Currently DFR, > BM25 and LM are on the menu. > Done: > * {{EasyStats}}: contains all statistics that might be relevant for a > ranking algorithm > * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the > DocScorers and as much implementation detail as possible > * _BM25_: the current "mock" implementation might be OK > * _LM_ > * _DFR_ > * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3357) Unit and integration test cases for the new Similarities
Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Fix For: flexscoring branch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week. As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review? > Implement various ranking models as Similarities > > > Key: LUCENE-3220 > URL: https://issues.apache.org/jira/browse/LUCENE-3220 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/search >Affects Versions: flexscoring branch >Reporter: David Mark Nemeskey >Assignee: David Mark Nemeskey > Labels: gsoc > Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we > can finally work on implementing the standard ranking models. Currently DFR, > BM25 and LM are on the menu. > Done: > * {{EasyStats}}: contains all statistics that might be relevant for a > ranking algorithm > * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the > DocScorers and as much implementation detail as possible > * _BM25_: the current "mock" implementation might be OK > * _LM_ > * _DFR_ > * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3343) Comparison operators >,>=,<,<= and = support as RangeQuery syntax in QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076091#comment-13076091 ] Olivier Favre commented on LUCENE-3343: --- Great, thanks! No blockers for 3x? > Comparison operators >,>=,<,<= and = support as RangeQuery syntax in > QueryParser > > > Key: LUCENE-3343 > URL: https://issues.apache.org/jira/browse/LUCENE-3343 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/queryparser >Reporter: Olivier Favre >Assignee: Adriano Crestani >Priority: Minor > Labels: parser, query > Fix For: 3.4, 4.0 > > Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch > > Original Estimate: 96h > Remaining Estimate: 96h > > To offer better interoperability with other search engines and to provide an > easier and more straight forward syntax, > the operators >, >=, <, <= and = should be available to express an open range > query. > They should at least work for numeric queries. > '=' can be made a synonym for ':'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3356) trunk TestRollingUpdates.testRollingUpdates seed failure
trunk TestRollingUpdates.testRollingUpdates seed failure Key: LUCENE-3356 URL: https://issues.apache.org/jira/browse/LUCENE-3356 Project: Lucene - Java Issue Type: Bug Reporter: selckin trunk r1152892 reproducable: always {code} junit-sequential: [junit] Testsuite: org.apache.lucene.index.TestRollingUpdates [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.168 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestRollingUpdates -Dtestmethod=testRollingUpdates -Dtests.seed=-5322802004404580273:-4001225075726350391 [junit] WARNING: test method: 'testRollingUpdates' left thread running: merge thread: _c(4.0):cv3/2 _h(4.0):cv3 into _k [junit] RESOURCE LEAK: test method: 'testRollingUpdates' left 1 thread(s) running [junit] NOTE: test params are: codec=RandomCodecProvider: {docid=Standard, body=SimpleText, title=MockSep, titleTokenized=Pulsing(freqCutoff=20), date=MockFixedIntBlock(blockSize=1474)}, locale=lv_LV, timezone=Pacific/Fiji [junit] NOTE: all tests run in this JVM: [junit] [TestRollingUpdates] [junit] NOTE: Linux 2.6.39-gentoo amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=8,threads=1,free=128782656,total=158400512 [junit] - --- [junit] Testcase: testRollingUpdates(org.apache.lucene.index.TestRollingUpdates): FAILED [junit] expected:<20> but was:<21> [junit] junit.framework.AssertionFailedError: expected:<20> but was:<21> [junit] at org.apache.lucene.index.TestRollingUpdates.testRollingUpdates(TestRollingUpdates.java:76) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) [junit] [junit] [junit] Test org.apache.lucene.index.TestRollingUpdates FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org