date:20110802

[jira] [Created] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

2011-08-02 Thread Trejkaz (JIRA)

StandardTokenizer disposes of Hiragana combining mark dakuten instead of 
attaching it to the character it belongs to


 Key: LUCENE-3358
 URL: https://issues.apache.org/jira/browse/LUCENE-3358
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.3
Reporter: Trejkaz


Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for 
tokenising hiragana, if combining marks are in use.

Here's a unit test:

{code}
@Test
public void testHiraganaWithCombiningMarkDakuten() throws Exception
{
// Hiragana 'S' following by the combining mark dakuten
TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new 
StringReader("\u3055\u3099"));

// Should be kept together.
List expectedTokens = Arrays.asList("\u3055\u3099");
List actualTokens = new LinkedList();
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
while (stream.incrementToken())
{
actualTokens.add(term.toString());
}

assertEquals("Wrong tokens", expectedTokens, actualTokens);

}
{code}

This code fails with:
{noformat}
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
{noformat}

It seems as if the tokeniser is throwing away the combining mark entirely.

3.0's behaviour was also undesirable:
{noformat}
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
{noformat}

But at least the token was there, so it was possible to write a filter to work 
around the issue.

Katakana seems to be avoiding this particular problem, because all katakana and 
combining marks found in a single run seem to be lumped into a single token 
(this is a problem in its own right, but I'm not sure if it's really a bug.)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 129 - Failure

2011-08-02 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/129/

2 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.MultiCoreExampleJettyTest.testMultiCore

Error Message:
Severe errors in solr configuration.  Check your log files for more detailed 
information on what may be wrong.  
- 
java.lang.RuntimeException: java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890
 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock 
(No such file or directory)  at 
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392)  at 
org.apache.solr.core.SolrCore.(SolrCore.java:562)  at 
org.apache.solr.core.SolrCore.(SolrCore.java:509)  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet  Severe errors 
in solr configuration.  Check your log files for more detailed information on 
what may be wrong.  
- 
java.lang.RuntimeException: java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890
 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock 
(No such file or directory)  at 
org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392)  at 
org.apache.solr.core.SolrCore.(SolrCore.java:562)  at 
org.apache.solr.core.SolrCore.(SolrCore.java:509)  at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet  request: 
http://localhost:27238/example/core0/update?commit=true&waitSearcher=true&wt=javabin&version=2

Stack Trace:


request: 
http://localhost:27238/example/core0/update?commit=true&waitSearcher=true&wt=javabin&version=2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:434)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104)
at 
org.apache.solr.client.solrj.MultiCoreExampleTestBase.testMultiCore(MultiCoreExampleTestBas

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078578#comment-13078578
 ] 

David Smiley commented on SOLR-2690:


Hoss, thanks for elaborating on the distinction between the date literal and 
the DateMath timezone. I was conflating these issues in my mind -- silly me.

> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 3.3
>Reporter: David Schlotfeldt
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2691:
---

Attachment: SOLR-2691.patch

patch of persistence tests at the CoreContainer level (since that's where the 
bug was)  that incorporates Yury's fix.

the assertions could definitely be beefed up to sanity check more aspects of 
the serialization, and we should really also be testing that "load" works and 
parses all of these things back in in the expected way as well, but it's a 
start.

The thing that's currently hanging me up is that somehow the test is leaking a 
SolrIndexSearcher reference.  I thought maybe it was because of the SolrCores i 
was creating+registering and then ignoring, but if i try to close them i get an 
error about too many decrefs instead.

I'll let miller figure it out

> solr.xml persistence is broken for multicore (by SOLR-2331)
> ---
>
> Key: SOLR-2691
> URL: https://issues.apache.org/jira/browse/SOLR-2691
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 4.0
>Reporter: Yury Kats
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 4.0
>
> Attachments: SOLR-2691.patch, jira2691.patch
>
>
> With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
> command,
> the solr.xml gets overwritten with only the last core, repeated as many times
> as there are cores.
> It used to work fine with a trunk build from a couple of months ago, so it 
> looks like
> something broke solr.xml persistence. 
> It appears to have been introduced by SOLR-2331:
> CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
> outside 
> of the loop that iterates over cores. Therefore, all cores reuse the same map 
> of attributes
> and hence only the values from the last core are preserved and used for all 
> cores in the list.
> I'm running SolrCloud, using:
> -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
> -DzkRun
> I'm starting Solr with four cores listed in solr.xml:
> 
>   
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>   
> 
> I then issue a PERSIST request:
> http://localhost:8983/solr/admin/cores?action=PERSIST
> And the solr.xml turns into:
> 
>zkClientTimeout="1" hostPort="8983" hostContext="solr">
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>   
> 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2691:
--

Fix Version/s: 4.0
 Assignee: Mark Miller

> solr.xml persistence is broken for multicore (by SOLR-2331)
> ---
>
> Key: SOLR-2691
> URL: https://issues.apache.org/jira/browse/SOLR-2691
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 4.0
>Reporter: Yury Kats
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 4.0
>
> Attachments: jira2691.patch
>
>
> With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
> command,
> the solr.xml gets overwritten with only the last core, repeated as many times
> as there are cores.
> It used to work fine with a trunk build from a couple of months ago, so it 
> looks like
> something broke solr.xml persistence. 
> It appears to have been introduced by SOLR-2331:
> CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
> outside 
> of the loop that iterates over cores. Therefore, all cores reuse the same map 
> of attributes
> and hence only the values from the last core are preserved and used for all 
> cores in the list.
> I'm running SolrCloud, using:
> -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
> -DzkRun
> I'm starting Solr with four cores listed in solr.xml:
> 
>   
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>   
> 
> I then issue a PERSIST request:
> http://localhost:8983/solr/admin/cores?action=PERSIST
> And the solr.xml turns into:
> 
>zkClientTimeout="1" hostPort="8983" hostContext="solr">
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>   
> 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2979) Simplify configuration API of contrib Query Parser

2011-08-02 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078549#comment-13078549
 ] 

Adriano Crestani commented on LUCENE-2979:
--

Hi Phillipe,

Thanks for the patch. I just applied your patch for 3x. It looks good.

As you removed TestAttributes, can you create another junit to test whether 
configuration is updated when an attribute (like CharTermAttribute) is updated, 
which is basically the new functionality of the newly deprecated query parser 
attributes.

> Simplify configuration API of contrib Query Parser
> --
>
> Key: LUCENE-2979
> URL: https://issues.apache.org/jira/browse/LUCENE-2979
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Affects Versions: 2.9, 3.0
>Reporter: Adriano Crestani
>Assignee: Adriano Crestani
>  Labels: api-change, gsoc, gsoc2011, lucene-gsoc-11, mentor
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-2979_phillipe_ramalho_2.patch, 
> LUCENE-2979_phillipe_ramalho_3.patch, LUCENE-2979_phillipe_ramalho_3.patch, 
> LUCENE-2979_phillipe_ramalho_4_3x.patch, 
> LUCENE-2979_phillipe_ramalho_4_trunk.patch, 
> LUCENE-2979_phillipe_reamalho.patch
>
>
> The current configuration API is very complicated and inherit the concept 
> used by Attribute API to store token information in token streams. However, 
> the requirements for both (QP config and token stream) are not the same, so 
> they shouldn't be using the same thing.
> I propose to simplify QP config and make it less scary for people intending 
> to use contrib QP. The task is not difficult, it will just require a lot of 
> code change and figure out the best way to do it. That's why it's a good 
> candidate for a GSoC project.
> I would like to hear good proposals about how to make the API more friendly 
> and less scaring :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

CHANGES.txt for modules

2011-08-02 Thread Adriano Crestani

I can see the description of changes made to the modules are still in
contrib/CHANGES.txt. Are they going to be moved in future to a
modules/CHANGES.txt?

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078538#comment-13078538
 ] 

David Schlotfeldt commented on SOLR-2690:
-

Being able to specify dates in timezones other then GMT+0 isn't a problem. It 
would just be nice but we can gnore that.

The time zone the DateMathParser is configured with is the issue (which it 
sounds like you understand.) My solution that changes the timezone 
DateMathParser is constructed with in SimpleFacet to parse start, end and gap 
isn't ideal. I went this route because I don't want to run a custom built Solr 
-- my solution allowed me to fix the "bug" by simply replacing the "facet" 
SearchComponent. Affecting all DateMathParsrs created for length of the request 
is what is really needed (which is what you said). I like your approach.

It sounds like we are on the same page.

So, can we get this added? :)

Without time zone affecting DateMathParser the date faceting is useless (at 
least for 100% the situations I would use it for)

By the way, I'm gald to see how many responses there have been. I'm happy to 
see how active this project is. :)

> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 3.3
>Reporter: David Schlotfeldt
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Did solr.xml persistence brake?

2011-08-02 Thread Chris Hostetter


: I opened SOLR-2691 to track and attached a patch.
: 
: Would appreciate a quick look from a committer. Thanks!

I'm not too familiar with that code, but i can definitely reproduce the 
bug ... i'll take a look at the existing tests and see if i can help out 
with your patch.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078530#comment-13078530
 ] 

Hoss Man commented on SOLR-2689:


Hmmm, ok ... it looks like maybe this bug was spured on by a recent mailing 
list thread about score filtering where someone referred to this even older 
thread with msg from Yonik...

http://search-lucene.com/m/4AHNF17wIJW1/

...based on his wording ("frange could possible help ... perhaps something 
like...", i don't think yonik really thought that answer through very hard, so 
it shouldn't be taken as gospel that he was advocating that solution would work 
(even though strictly speaking it does filter by score) let alone "will work 
and will still give you meaningful scores that you can sort on"

If you want to filter by arbitrary score (and i won't bother to list all the 
reasons i think that is a bad idea) and still get those score back and be able 
to sort on them, then you still need the "q" to be a query that produces 
scores, and leave the filtering to an "fq"...

{code}
?q=ipod&fl=*,score&fq={!frange+l=0.72}query($q)
{code}

> !frange with query($qq) sets score=1.0f for all returned documents
> --
>
> Key: SOLR-2689
> URL: https://issues.apache.org/jira/browse/SOLR-2689
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.4
>Reporter: Markus Jelsma
> Fix For: 3.4, 4.0
>
>
> Consider the following queries, both query the default field for 'test' and 
> return the document digest and score (i don't seem to be able get only score, 
> fl=score returns all fields):
> This is a normal query and yields normal results with proper scores:
> {code}
> q=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 4.952673
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 4.952673
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 4.952673
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}
> This query uses frange with query() to limit the number of returned 
> documents. When using multiple search terms i can indeed cut-off the result 
> set but in the end all returned documents have score=1.0f. The final result 
> set cannot be sorted by score anymore. The result set seems to be returned in 
> the order of Lucene docId's.
> {code}
> q={!frange l=1.23}query($qq)&qq=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 1.0
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 1.0
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 1.0
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Did solr.xml persistence brake?

2011-08-02 Thread Yury Kats

On 8/2/2011 5:42 PM, Yury Kats wrote:

> It used to work fine with a trunk build from a couple of months ago, so it 
> looks like
> something broke solr.xml persistence. Can it be related to SOLR-2331?

Looking at the code, it does seem like a regression from SOLR-2331.
CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
outside
of the loop that iterates over cores. Therefore, all cores reuse the same map 
of attributes
and hence only the values from the last core are preserved and used for all 
cores in the list.

I opened SOLR-2691 to track and attached a patch.

Would appreciate a quick look from a committer. Thanks!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Yury Kats (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yury Kats updated SOLR-2691:


Attachment: jira2691.patch

Patch. Create map of attributes inside the loop.

> solr.xml persistence is broken for multicore (by SOLR-2331)
> ---
>
> Key: SOLR-2691
> URL: https://issues.apache.org/jira/browse/SOLR-2691
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 4.0
>Reporter: Yury Kats
>Priority: Critical
> Attachments: jira2691.patch
>
>
> With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin 
> command,
> the solr.xml gets overwritten with only the last core, repeated as many times
> as there are cores.
> It used to work fine with a trunk build from a couple of months ago, so it 
> looks like
> something broke solr.xml persistence. 
> It appears to have been introduced by SOLR-2331:
> CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
> outside 
> of the loop that iterates over cores. Therefore, all cores reuse the same map 
> of attributes
> and hence only the values from the last core are preserved and used for all 
> cores in the list.
> I'm running SolrCloud, using:
> -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf 
> -DzkRun
> I'm starting Solr with four cores listed in solr.xml:
> 
>   
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>  collection="hcpconf" />
>   
> 
> I then issue a PERSIST request:
> http://localhost:8983/solr/admin/cores?action=PERSIST
> And the solr.xml turns into:
> 
>zkClientTimeout="1" hostPort="8983" hostContext="solr">
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>  collection="hcpconf"/>
>   
> 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2331) Refactor CoreContainer's SolrXML serialization code and improve testing

2011-08-02 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078526#comment-13078526
 ] 

Yury Kats commented on SOLR-2331:
-

Looks like this introduced a regression in solr.xml persistence. 
See SOLR-2691.

> Refactor CoreContainer's SolrXML serialization code and improve testing
> ---
>
> Key: SOLR-2331
> URL: https://issues.apache.org/jira/browse/SOLR-2331
> Project: Solr
>  Issue Type: Improvement
>  Components: multicore
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2331-fix-windows-file-deletion-failure.patch, 
> SOLR-2331-fix-windows-file-deletion-failure.patch, SOLR-2331.patch
>
>
> CoreContainer has enough code in it - I'd like to factor out the solr.xml 
> serialization code into SolrXMLSerializer or something - which should make 
> testing it much easier and lightweight.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)

2011-08-02 Thread Yury Kats (JIRA)

solr.xml persistence is broken for multicore (by SOLR-2331)
---

 Key: SOLR-2691
 URL: https://issues.apache.org/jira/browse/SOLR-2691
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 4.0
Reporter: Yury Kats
Priority: Critical


With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command,
the solr.xml gets overwritten with only the last core, repeated as many times
as there are cores.

It used to work fine with a trunk build from a couple of months ago, so it 
looks like
something broke solr.xml persistence. 

It appears to have been introduced by SOLR-2331:
CoreContainer#persistFile creates the map for core attributes (coreAttribs) 
outside 
of the loop that iterates over cores. Therefore, all cores reuse the same map 
of attributes
and hence only the values from the last core are preserved and used for all 
cores in the list.

I'm running SolrCloud, using:
-Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun

I'm starting Solr with four cores listed in solr.xml:


  




  


I then issue a PERSIST request:
http://localhost:8983/solr/admin/cores?action=PERSIST

And the solr.xml turns into:


  




  


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078522#comment-13078522
 ] 

Hoss Man commented on SOLR-2689:


I don't really understand why this is a bug?

"frange" is the FunctionRangeQParserPlugin which produces 
ConstantScoreRangeQueries -- it doesn't matter when/how/why it's used (or that 
the function it's wrapping comes from an arbitrary query), it always produces 
range queries that generate constant scores.

> !frange with query($qq) sets score=1.0f for all returned documents
> --
>
> Key: SOLR-2689
> URL: https://issues.apache.org/jira/browse/SOLR-2689
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.4
>Reporter: Markus Jelsma
> Fix For: 3.4, 4.0
>
>
> Consider the following queries, both query the default field for 'test' and 
> return the document digest and score (i don't seem to be able get only score, 
> fl=score returns all fields):
> This is a normal query and yields normal results with proper scores:
> {code}
> q=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 4.952673
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 4.952673
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 4.952673
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}
> This query uses frange with query() to limit the number of returned 
> documents. When using multiple search terms i can indeed cut-off the result 
> set but in the end all returned documents have score=1.0f. The final result 
> set cannot be sorted by score anymore. The result set seems to be returned in 
> the order of Lucene docId's.
> {code}
> q={!frange l=1.23}query($qq)&qq=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 1.0
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 1.0
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 1.0
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2690:
---

Issue Type: Improvement  (was: Bug)

bq. If I want midnight in Central time zone I shouldn't have to write:  
2011-01-01T06:00:00Z (Note I wrote 6:00 not 0:00)

"Central time zone" is a vague concept that may mean one thing to you, but may 
mean something different to someone else.  For any arbitrary moment in the (one 
dimensional) space of time values, there are an infinite number of ways to 
represent that time as a string (or as number) depending on where you place 
your origin for the coordinate system.  Requiring that clients format times in 
UTC is no different then requiring clients to use Arabic numerals to represent 
integers -- it's just a matter of making sure there is no ambiguity, and 
everyone is using the same definition of "0".  UTC is a completely unambiguous 
coordinate system for times, that is guaranteed to work in any JVM that Solr 
might run on.  Even if we added code to allow dates to be expressed in 
arbitrary user selected timezones, we couldn't make that garuntee.

Bottom line: the issue of parsing/formatting times in other coordinate systems 
(ie: timezones) should not be convoluted with the issue of what timezone is 
used by the DateMathParser when rounding -- those are distinct issues.  It's 
completely conceivable to have a QParser that accepts a variety of data formats 
and "guesses" what TZ is meant and use that QParser in the same request where 
you want date faceting based on a TZ that is specified distinctly from the 
query string (ie: user's local TZ is UTC-0700, but they are searching for 
records dated before "Dec 15, 2010 4:20PM EST")

bq. So one possible alternative that needs more thought is a "TZ" request 
parameter that would apply by default to things that are date related.

Right ... from the beginning DateMathparser was designed with the hope that a 
TZ/Locale pair could be specified per request (or per field declaration) for 
driving the rounding/math logic, there was just no sane way to specify an 
alternative to UTC/US that could be past down into the DateMathParser and used 
ubiquitously in a request because of the FieldType API.

(Slight digression...

bq. its really only essential that we can affect DateMathParser the 
SimpleFacets uses when dealing with the gap of the date facets.

...just changing the TZ used by that instance of DateMathParser for 
rounding/math isn't going to do any good if the user then tries to filter on 
one of those constraints and the filter query code winds up using the defaults 
in DateField (ie: NOW/DAY and NOW/DAY+1HOUR are going to be very differnet 
things in the facet count code path vs the filter query code path))

Now that we have SolrRequestInfo and a request param to specify the meaning of 
"NOW", the same logic could be used to allow a request param to specify the 
TZ/Locale properties of the DateMathParser as well.

But like I said: this should really only be used to affect the *math* in 
DateMathParser -- it should not be used in DateField.parseDate/formatDate 
because DateField by definition deals with a single canonical time format, by 
the time the DateField class is involved in dealing with a Date everything 
should be un-ambiguisly expressable in UTC.

logic for parsing date strings that aren't in the canonical date format should 
be a QParser responsibility at query time, or an UpdateProcessor responsibility 
at index time.  Logic for formatting dates in non-canonical format should be a 
ResponseWriter responsibility.  This new request property we're talking about 
for defining the "users TZ" can certainly be used in all of these places to 
pick/override defaults, but that type of logic really doesn't belong in 
DateField.

> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 3.3
>Reporter: David Schlotfeldt
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could

[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078467#comment-13078467
 ] 

Markus Jelsma commented on SOLR-2689:
-

You are right, it's because both examples use one search term and thus all have 
the same score. It shows when not all scores are identical when you use 
multiple terms. I'll provide a better description and example next week when 
i'll get back.

> !frange with query($qq) sets score=1.0f for all returned documents
> --
>
> Key: SOLR-2689
> URL: https://issues.apache.org/jira/browse/SOLR-2689
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.4
>Reporter: Markus Jelsma
> Fix For: 3.4, 4.0
>
>
> Consider the following queries, both query the default field for 'test' and 
> return the document digest and score (i don't seem to be able get only score, 
> fl=score returns all fields):
> This is a normal query and yields normal results with proper scores:
> {code}
> q=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 4.952673
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 4.952673
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 4.952673
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}
> This query uses frange with query() to limit the number of returned 
> documents. When using multiple search terms i can indeed cut-off the result 
> set but in the end all returned documents have score=1.0f. The final result 
> set cannot be sorted by score anymore. The result set seems to be returned in 
> the order of Lucene docId's.
> {code}
> q={!frange l=1.23}query($qq)&qq=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 1.0
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 1.0
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 1.0
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2011-08-02 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-1692:
-

Assignee: (was: Grant Ingersoll)

> CarrotClusteringEngine produce summary does nothing
> ---
>
> Key: SOLR-1692
> URL: https://issues.apache.org/jira/browse/SOLR-1692
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-1692.patch
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
> results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Did solr.xml persistence brake?

2011-08-02 Thread Yury Kats

Hi,

With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command,
the solr.xml gets overwritten with only the last core, repeated as many times
as there are cores.

It used to work fine with a trunk build from a couple of months ago, so it 
looks like
something broke solr.xml persistence. Can it be related to SOLR-2331?

I'm running SolrCloud, using:
-Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun

I'm starting Solr with four cores listed in solr.xml:


  




  


I then issue a PERSIST request:
http://localhost:8983/solr/admin/cores?action=PERSIST

And the solr.xml turns into:


  




  


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078451#comment-13078451
 ] 

Otis Gospodnetic commented on SOLR-2689:


Markus - I can't even tell this frange call cuts-off any of the hits - you have 
numFound="227763" in both examples above.  Am I missing something? :)

> !frange with query($qq) sets score=1.0f for all returned documents
> --
>
> Key: SOLR-2689
> URL: https://issues.apache.org/jira/browse/SOLR-2689
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.4
>Reporter: Markus Jelsma
> Fix For: 3.4, 4.0
>
>
> Consider the following queries, both query the default field for 'test' and 
> return the document digest and score (i don't seem to be able get only score, 
> fl=score returns all fields):
> This is a normal query and yields normal results with proper scores:
> {code}
> q=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 4.952673
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 4.952673
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 4.952673
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}
> This query uses frange with query() to limit the number of returned 
> documents. When using multiple search terms i can indeed cut-off the result 
> set but in the end all returned documents have score=1.0f. The final result 
> set cannot be sorted by score anymore. The result set seems to be returned in 
> the order of Lucene docId's.
> {code}
> q={!frange l=1.23}query($qq)&qq=test&fl=digest,score
> {code}
> {code}
> 
> −
> 
> 1.0
> c48e784f06a051d89f20b72194b0dcf0
> 
> −
> 
> 1.0
> 7f78a504b8cbd86c6cdbf2aa2c4ae5e3
> 
> −
> 
> 1.0
> 0f7fefa6586ceda42fc1f095d460aa17
> 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2011-08-02 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078434#comment-13078434
 ] 

Grant Ingersoll commented on LUCENE-2748:
-

I wonder if the best thing to do here is to simply start fresh and clean and 
simply leave all existing content up as is and link to it as the "old" content.

> Convert all Lucene web properties to use the ASF CMS
> 
>
> Key: LUCENE-2748
> URL: https://issues.apache.org/jira/browse/LUCENE-2748
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>
> The new CMS has a lot of nice features (and some kinks to still work out) and 
> Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
> http://apache.org/dev/cms.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-2143) Add OpenSearch resources to the bundled example

2011-08-02 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-2143:
-

Assignee: (was: Grant Ingersoll)

> Add OpenSearch resources to the bundled example 
> 
>
> Key: SOLR-2143
> URL: https://issues.apache.org/jira/browse/SOLR-2143
> Project: Solr
>  Issue Type: Wish
>  Components: documentation
>Affects Versions: 4.0
> Environment: N/A
>Reporter: Rich Cariens
> Fix For: 4.0
>
> Attachments: SOLR-2143.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Guidance & samples on how to make a Solr instance OpenSearch-compliant feels 
> like it would add value to the user community.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078419#comment-13078419
 ] 

David Schlotfeldt commented on SOLR-2690:
-

Okay I've modified my code to now take "facet.date.tz" instead. The time zone 
now affects the facet's start, end and gap values.

> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.3
>Reporter: David Schlotfeldt
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[Lucene.Net] [jira] [Resolved] (LUCENENET-404) Improve brand logo design

2011-08-02 Thread Troy Howard (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENENET-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Troy Howard resolved LUCENENET-404.
---

Resolution: Fixed

Uploaded the artifacts in r1153264

> Improve brand logo design
> -
>
> Key: LUCENENET-404
> URL: https://issues.apache.org/jira/browse/LUCENENET-404
> Project: Lucene.Net
>  Issue Type: Sub-task
>  Components: Project Infrastructure
>Reporter: Troy Howard
>Assignee: Troy Howard
>Priority: Minor
>  Labels: branding, logo
> Attachments: lucene-alternates.jpg, lucene-medium.png, 
> lucene-net-logo-display.jpg
>
>
> The existing Lucene.Net logo leaves a lot to be desired. We'd like a new logo 
> that is modern and well designed. 
> To implement this, Troy is coordinating with StackOverflow/StackExchange to 
> manage a logo design contest, the results of which will be our new logo 
> design. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

EasySimilarity now computes norms in the same way as DefaultSimilarity.

Actually not exactly the same way, as I have not yet added the discountOverlaps 
property. I think it would be a good idea for EasySimilarity as well (it is for 
phrases, right), what do you reckon?

I also wrote a quick test to see which norm (length directly or 1/sqrt) is 
closer to the original value and it seems that the direct one is usually much 
closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know 
it is much more important that the new Similarities can use existing indices.

> Implement various ranking models as Similarities
> 
>
> Key: LUCENE-3220
> URL: https://issues.apache.org/jira/browse/LUCENE-3220
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/query/scoring, core/search
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>  Labels: gsoc, gsoc2011
> Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> Done:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
>  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078400#comment-13078400
 ] 

Simon Willnauer commented on LUCENE-3030:
-

bq. These are huge speedups for the terms-dict intensive queries!
oh boy! This is awesome!

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-08-02 Thread James Dyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078384#comment-13078384
 ] 

James Dyer commented on SOLR-2382:
--

Lance,

I do not have any scientific benchmarks, but I can tell you how we use 
BerkleyBackedCache and how it performs for us.  

In our main app, we fully re-index all our data every night (13+ million 
records).  Its basically a 2-step process.  First we run ~50 DIH handlers, each 
of which builds a cache from databases, flat files, etc.  The caches partition 
the data 8-ways.  Then a "master" DIH script does all the joining, runs 
transformers on the data, etc.  We have all 8 invocations of this same "master" 
DIH config running simultaneously indexing to the same Solr core, so each DIH 
invocation is processing 1.6 million records directly out of caches, doing all 
the 1-many joins, running transformer code, indexing, etc.  This takes 1-1/2 
hours, so maybe 250-300 solr records get added per second.  We're using fast 
local disks configured with raid-0 on an 8-core 64gb server.  This app is 
running solr 1.4, using the original version of this patch, prior to my 
front-porting it to trunk.  No doubt some of the time is spent contending for 
the Lucene index as all 8 DIH invocations are indexing at the same time
 .

We also have another app that uses Solr4.0 with the patch I originally posted 
back in February, sharing hardware with the main app.  This one has about 10 
entities and uses a simple 1-dih-handler configuration.  The parent entity 
drives directly off the database while all the child entities use 
SqlEntityProcessor with BerkleyBackedCache.  There are only 25,000 fairly 
narrow records and we can re-index everything in about 10 minutes.  This 
includes database time, indexing, running transformers, etc in addition to the 
cache overhead.

The inspiration for this was that we were converting off of Endeca and we were 
relying on Endeca's "Forge" program to join & denormalize all of the data.  
Forge has a very fast disk-backed caching mechanism and I needed to match that 
performance with DIH.  I'm pretty sure what we have here surpasses Forge.  And 
we also get a big bonus in that it lets you persist caches and use them as a 
subsequent input.  With Forge, we had to output the data into huge delimited 
text files and then use that as input for the next step...

Hope this information gives you some idea if this will work for your use case.

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhea

[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 125 - Still Failing

2011-08-02 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/125/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
expected:<2> but was:<3>

Stack Trace:
junit.framework.AssertionFailedError: expected:<2> but was:<3>
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:198)




Build Log (for compile errors):
[...truncated 11154 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3343) Comparison operators >,>=,<,<= and = support as RangeQuery syntax in QueryParser

2011-08-02 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078372#comment-13078372
 ] 

Adriano Crestani commented on LUCENE-3343:
--

Hi Oliver,

I was only able to make your patch work when I merged with LUCENE-3338, however 
LUCENE-3338 is only available for trunk, not 3x. I will need to figure out some 
other way to make it work on 3x. I plan to work on this soon, probably next 
weekend.

> Comparison operators >,>=,<,<= and = support as RangeQuery syntax in 
> QueryParser
> 
>
> Key: LUCENE-3343
> URL: https://issues.apache.org/jira/browse/LUCENE-3343
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/queryparser
>Reporter: Olivier Favre
>Assignee: Adriano Crestani
>Priority: Minor
>  Labels: parser, query
> Fix For: 3.4, 4.0
>
> Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> To offer better interoperability with other search engines and to provide an 
> easier and more straight forward syntax,
> the operators >, >=, <, <= and = should be available to express an open range 
> query.
> They should at least work for numeric queries.
> '=' can be made a synonym for ':'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-08-02 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078354#comment-13078354
 ] 

Lance Norskog commented on SOLR-2382:
-

Hello-

Are there any benchmark results with this patch? Given, say two tables with a 
million elements in a pairwise join, how well does this caching system work?



> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is no

[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker

2011-08-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078353#comment-13078353
 ] 

Robert Muir commented on SOLR-2688:
---

I'll work up a patch, might tweak the example a bit for the time being, I'd 
like to err on the side of performance.

Note: with LUCENE-3030, Mike has really sped this guy up again.

> switch solr 4.0 example to DirectSpellChecker
> -
>
> Key: SOLR-2688
> URL: https://issues.apache.org/jira/browse/SOLR-2688
> Project: Solr
>  Issue Type: Improvement
>  Components: spellchecker
>Affects Versions: 4.0
>Reporter: Robert Muir
>
> For discussion: we might want to switch the Solr 4.0 example to use 
> DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David Schlotfeldt (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078352#comment-13078352
 ] 

David Schlotfeldt commented on SOLR-2690:
-

By extending FacetComponent (and having to resort to reflection) I added: 
facet.date.gap.tz

The new parameter only affects the gap. The math done with processing the gap 
is the largest issue when it comes it date faceting in my mind.

I would be more then happy to provide a patch to add this feature.

No this doesn't address all timezone issues but at least it would address the 
main issue that makes date faceting, in my eyes, completely useless. I bet 
there are 100s of people out there using date faceting that don't realize it 
does NOT give correct results :)




> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.3
>Reporter: David Schlotfeldt
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078337#comment-13078337
 ] 

Michael McCandless commented on LUCENE-3030:


Here's the graph of the results:

!BlockTree.png!


> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3030:
---

Attachment: BlockTree.png

The block tree terms dict seems to be working... all tests pass w/
StandardTree codec.  There's still more to do (many nocommits), but, I
think the perf results should be close to what I finally commit:

||Task||QPS base||StdDev base||QPS blocktree||StdDev blocktree||Pct diff
|IntNRQ|11.58|1.37|10.11|1.77|{color:red}35%{color}-{color:green}16%{color}|
|Term|106.65|3.24|98.84|4.97|{color:red}14%{color}-{color:green}0%{color}|
|Prefix3|30.83|1.36|28.64|2.42|{color:red}18%{color}-{color:green}5%{color}|
|OrHighHigh|5.85|0.15|5.44|0.28|{color:red}14%{color}-{color:green}0%{color}|
|OrHighMed|19.25|0.62|17.91|0.86|{color:red}14%{color}-{color:green}0%{color}|
|Phrase|9.37|0.42|8.87|0.10|{color:red}10%{color}-{color:green}0%{color}|
|TermBGroup1M|44.02|0.90|42.76|1.08|{color:red}7%{color}-{color:green}1%{color}|
|TermGroup1M|37.68|0.65|36.95|0.74|{color:red}5%{color}-{color:green}1%{color}|
|TermBGroup1M1P|47.16|2.94|46.36|0.16|{color:red}7%{color}-{color:green}5%{color}|
|SpanNear|5.60|0.35|5.55|0.29|{color:red}11%{color}-{color:green}11%{color}|
|SloppyPhrase|3.36|0.16|3.34|0.04|{color:red}6%{color}-{color:green}5%{color}|
|Wildcard|35.15|1.30|35.05|2.42|{color:red}10%{color}-{color:green}10%{color}|
|AndHighHigh|10.71|0.22|10.99|0.22|{color:red}1%{color}-{color:green}6%{color}|
|AndHighMed|51.15|1.44|54.31|1.84|{color:green}0%{color}-{color:green}12%{color}|
|Fuzzy1|31.63|0.55|66.15|1.35|{color:green}101%{color}-{color:green}117%{color}|
|PKLookup|40.00|0.75|84.93|5.49|{color:green}94%{color}-{color:green}130%{color}|
|Fuzzy2|33.78|0.82|89.59|2.46|{color:green}151%{color}-{color:green}179%{color}|
|Respell|23.56|1.15|70.89|1.77|{color:green}179%{color}-{color:green}224%{color}|

This is for a multi-segment index, 10 M wikipedia docs, using luceneutil.

These are huge speedups for the terms-dict intensive queries!

The two FuzzyQuerys and Respell get the speedup from the directly
implemented intersect method, and the PKLookup gets gains because it
can often avoid seeking since block tree's terms index can sometimes
rule out terms by their prefix (though, this relies on the PK terms
being "predictable" -- I use "%09d" w/ a counter, now; if you instead
used something more random looking (GUIDs )I don't think we'd see
gains).


> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
>

[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 124 - Failure

2011-08-02 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/124/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:639)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:99)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:174)




Build Log (for compile errors):
[...truncated 11177 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078292#comment-13078292
 ] 

David commented on SOLR-2690:
-

Good point.

Also, this isn't a bug but if we want a complete solution, we really need a way 
to specify times in other timezones.

If I want midnight in Central time zone I shouldn't have to write: 
2011-01-01T06:00:00Z
(Note I wrote 6:00 not 0:00)
I believe only DateField would have to be modified to make it possible to 
specify timezone.

For a complete example if I wanted to facet blog posts by the date posted and 
the month:

facet.date=blogPostDate
facet.date.start=2011-01-01T00:00:00
facet.date.end=2012-01-01T00:00:00
facet.date.gap=+1MONTH
timezone=America/Chicago

Currently you would need to do the following. (Which actually gives close to 
correct results but not exact. Again, problem is the gap of +1MONTH doesn't 
take daylight savings into account so blog posts on the edge of ranges are 
counted in the wrong range.

facet.date=blogPostDate
facet.date.start=2011-01-01T00:06:00Z
facet.date.end=2012-01-01T00:06:00Z
facet.date.gap=+1MONTH


> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.3
>Reporter: David
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076307#comment-13076307
 ] 

Yonik Seeley commented on SOLR-2690:


Although this probably isn't a "bug", I agree that handling timezones somehow 
would be nice.
We just need to think very carefully about the API so we can support it long 
term.

One immediate thought I had was that it would be a pain to specify the timezone 
everywhere.  Even a simple range query would need to specify it twice:
my_date:["(?timeZone=America/Chicago)NOW/YEAR" TO 
"(?timeZone=America/Chicago)+1MONTH"]

So one possible alternative that needs more thought is a "TZ" request parameter 
that would apply by default to things that are date related.


> Date Faceting or Range Faceting doesn't take timezone into account.
> ---
>
> Key: SOLR-2690
> URL: https://issues.apache.org/jira/browse/SOLR-2690
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.3
>Reporter: David
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Timezone needs to be taken into account when doing date math. Currently it 
> isn't. DateMathParser instances created are always being constructed with 
> UTC. This is a huge issue when it comes to faceting. Depending on your 
> timezone day-light-savings changes the length of a month. A facet gap of 
> +1MONTH is different depending on the timezone and the time of the year.
> I believe the issue is very simple to fix. There are three places in the code 
> DateMathParser is created. All three are configured with the timezone being 
> UTC. If a user could specify the TimeZone to pass into DateMathParser this 
> faceting issue would be resolved.
> Though it would be nice if we could always specify the timezone 
> DateMathParser uses (since date math DOES depend on timezone) its really only 
> essential that we can affect DateMathParser the SimpleFacets uses when 
> dealing with the gap of the date facets.
> Another solution is to expand the syntax of the expressions DateMathParser 
> understands. For example we could allow "(?timeZone=VALUE)" to be added 
> anywhere within an expression. VALUE would be the id of the timezone. When 
> DateMathParser reads this in sets the timezone on the Calendar it is using.
> Two examples:
> - "(?timeZone=America/Chicago)NOW/YEAR"
> - "(?timeZone=America/Chicago)+1MONTH"
> I would be more then happy to modify DateMathParser and provide a patch. I 
> just need a committer to agree this needs to be resolved and a decision needs 
> to be made on the syntax used
> Thanks!
> David

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone

2011-08-02 Thread David (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076303#comment-13076303
 ] 

David commented on SOLR-2525:
-

I have opened a new ticket about this: SOLR-2690

> Date Faceting or Range Faceting with offset doesn't convert timezone
> 
>
> Key: SOLR-2525
> URL: https://issues.apache.org/jira/browse/SOLR-2525
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis, search
>Affects Versions: 3.1
> Environment: Solr 3.1 
> Windows 2008 RC2 Server 
> Java 6
> Running on Jetty
>Reporter: Rohit Gupta
>  Labels: date, facet
>
> I am trying to facet based on date field and apply user timezone offset so 
> that the faceted results are in user timezone. My faceted result is given 
> below,
> 
> 
>   
>   0
>   6
>   
>   true
>   icici
>name="facet.range.start">2011-05-02T00:00:00Z+330MINUTES
>   createdOnGMTDate
>   2011-05-18T00:00:00Z
>   +1DAY
>   
>   
>   
> 
>   
>   
>   4
>   63
>   0
>   0
>  ..   
>   
>   +1DAY
>   2011-05-02T05:30:00Z
>   2011-05-18T05:30:00Z
>   
>  
>   
>   
> Now if you notice that the response show 4 records for the 2th of May 2011 
> which will fall in the IST timezone (+330MINUTES), but when I try to get the 
> results I see that there is only 1 result for the 2nd why is this happening.
> 
> 
>   
>   0
>   5
>   
>   createdOnGMTDate asc
>name="fl">createdOnGMT,createdOnGMTDate,twtText
>name="fq">createdOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *]  
>   icici
>   
>   
>   
>   
>   Mon, 02 May 2011 16:27:05+
>   2011-05-02T16:27:05Z
>   #TechStrat615. Infosys (business soln & 
> IT
>   outsourcer) manages damages with new chairman 
> K.Kamath (ex ICICI
>   Bank chairman) to begin Aug 21.
>   
>   
>   Mon, 02 May 2011 19:00:44+
>   2011-05-02T19:00:44Z
>   how to get icici mobile banking
>   
>   
>   Tue, 03 May 2011 01:53:05+
>   2011-05-03T01:53:05Z
>   ICICI BANK LTD, L. M. MIRAJ branch in 
> SANGLI,
>   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
> Code: ...
>   http://bit.ly/fJCuWl #ifsc #micr #bank
>   
>   
>   Tue, 03 May 2011 01:53:05+
>   2011-05-03T01:53:05Z
>   ICICI BANK LTD, L. M. MIRAJ branch in 
> SANGLI,
>   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
> Code: ...
>   http://bit.ly/fJCuWl #ifsc #micr #bank
>   
>   
>   Tue, 03 May 2011 08:52:37+
>   2011-05-03T08:52:37Z
>   RT @nice4ufan: ICICI BANK PERSONAL LOAN 
> http://ee4you.blogspot.com/2011/04/icici-bank-personal-loan.html
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.

2011-08-02 Thread David (JIRA)

Date Faceting or Range Faceting doesn't take timezone into account.
---

 Key: SOLR-2690
 URL: https://issues.apache.org/jira/browse/SOLR-2690
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.3
Reporter: David


Timezone needs to be taken into account when doing date math. Currently it 
isn't. DateMathParser instances created are always being constructed with UTC. 
This is a huge issue when it comes to faceting. Depending on your timezone 
day-light-savings changes the length of a month. A facet gap of +1MONTH is 
different depending on the timezone and the time of the year.

I believe the issue is very simple to fix. There are three places in the code 
DateMathParser is created. All three are configured with the timezone being 
UTC. If a user could specify the TimeZone to pass into DateMathParser this 
faceting issue would be resolved.

Though it would be nice if we could always specify the timezone DateMathParser 
uses (since date math DOES depend on timezone) its really only essential that 
we can affect DateMathParser the SimpleFacets uses when dealing with the gap of 
the date facets.

Another solution is to expand the syntax of the expressions DateMathParser 
understands. For example we could allow "(?timeZone=VALUE)" to be added 
anywhere within an expression. VALUE would be the id of the timezone. When 
DateMathParser reads this in sets the timezone on the Calendar it is using.

Two examples:
- "(?timeZone=America/Chicago)NOW/YEAR"
- "(?timeZone=America/Chicago)+1MONTH"

I would be more then happy to modify DateMathParser and provide a patch. I just 
need a committer to agree this needs to be resolved and a decision needs to be 
made on the syntax used


Thanks!
David


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2011-08-02 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076301#comment-13076301
 ] 

Shawn Heisey commented on SOLR-1972:


Hoss, the patch isn't my work, I just modified it to support a 100th percentile 
and reattached it.  I am only just now beginning to learn Java.  Although I 
have some clue what you're saying with static methods, actually doing it 
properly within a larger work like Solr is something I won't be able to do yet.

> Need additional query stats in admin interface - median, 95th and 99th 
> percentile
> -
>
> Key: SOLR-1972
> URL: https://issues.apache.org/jira/browse/SOLR-1972
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Shawn Heisey
>Priority: Minor
> Attachments: SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
> SOLR-1972.patch, elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
> elyograg-1972-trunk.patch, elyograg-1972-trunk.patch
>
>
> I would like to see more detailed query statistics from the admin GUI.  This 
> is what you can get now:
> requests : 809
> errors : 0
> timeouts : 0
> totalTime : 70053
> avgTimePerRequest : 86.59209
> avgRequestsPerSecond : 0.8148785 
> I'd like to see more data on the time per request - median, 95th percentile, 
> 99th percentile, and any other statistical function that makes sense to 
> include.  In my environment, the first bunch of queries after startup tend to 
> take several seconds each.  I find that the average value tends to be useless 
> until it has several thousand queries under its belt and the caches are 
> thoroughly warmed.  The statistical functions I have mentioned would quickly 
> eliminate the influence of those initial slow queries.
> The system will have to store individual data about each query.  I don't know 
> if this is something Solr does already.  It would be nice to have a 
> configurable count of how many of the most recent data points are kept, to 
> control the amount of memory the feature uses.  The default value could be 
> something like 1024 or 4096.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone

2011-08-02 Thread David (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076294#comment-13076294
 ] 

David commented on SOLR-2525:
-

Timezone needs to be taken into account when doing date math. Currently it 
isn't. DateMathParser instances created are told to use UTC. This is a huge 
issue when it comes to faceting. Depending on your timezone day-light-savings 
changes the length of a month. A facet gap of +1MONTH is different depending on 
the timezone and the time of the year.

I believe the issue is very simple to fix. There are three places in the code 
DateMathParser created. All three are configured with the timezone being UTC. 
If a user could specify the TimeZone to pass into DateMathParser this faceting 
issue would be resolved.

> Date Faceting or Range Faceting with offset doesn't convert timezone
> 
>
> Key: SOLR-2525
> URL: https://issues.apache.org/jira/browse/SOLR-2525
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis, search
>Affects Versions: 3.1
> Environment: Solr 3.1 
> Windows 2008 RC2 Server 
> Java 6
> Running on Jetty
>Reporter: Rohit Gupta
>  Labels: date, facet
>
> I am trying to facet based on date field and apply user timezone offset so 
> that the faceted results are in user timezone. My faceted result is given 
> below,
> 
> 
>   
>   0
>   6
>   
>   true
>   icici
>name="facet.range.start">2011-05-02T00:00:00Z+330MINUTES
>   createdOnGMTDate
>   2011-05-18T00:00:00Z
>   +1DAY
>   
>   
>   
> 
>   
>   
>   4
>   63
>   0
>   0
>  ..   
>   
>   +1DAY
>   2011-05-02T05:30:00Z
>   2011-05-18T05:30:00Z
>   
>  
>   
>   
> Now if you notice that the response show 4 records for the 2th of May 2011 
> which will fall in the IST timezone (+330MINUTES), but when I try to get the 
> results I see that there is only 1 result for the 2nd why is this happening.
> 
> 
>   
>   0
>   5
>   
>   createdOnGMTDate asc
>name="fl">createdOnGMT,createdOnGMTDate,twtText
>name="fq">createdOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *]  
>   icici
>   
>   
>   
>   
>   Mon, 02 May 2011 16:27:05+
>   2011-05-02T16:27:05Z
>   #TechStrat615. Infosys (business soln & 
> IT
>   outsourcer) manages damages with new chairman 
> K.Kamath (ex ICICI
>   Bank chairman) to begin Aug 21.
>   
>   
>   Mon, 02 May 2011 19:00:44+
>   2011-05-02T19:00:44Z
>   how to get icici mobile banking
>   
>   
>   Tue, 03 May 2011 01:53:05+
>   2011-05-03T01:53:05Z
>   ICICI BANK LTD, L. M. MIRAJ branch in 
> SANGLI,
>   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
> Code: ...
>   http://bit.ly/fJCuWl #ifsc #micr #bank
>   
>   
>   Tue, 03 May 2011 01:53:05+
>   2011-05-03T01:53:05Z
>   ICICI BANK LTD, L. M. MIRAJ branch in 
> SANGLI,
>   MAHARASHTRA. IFSC Code: ICIC0006537, MICR
> Code: ...
>   http://bit.ly/fJCuWl #ifsc #micr #bank
>   
>   
>   Tue, 03 May 2011 08:52:37+
>   2011-05-03T08:52:37Z
>   RT @nice4ufan: ICICI BANK PERSONAL LOAN 
> http://ee4you.blogspot.com/2011/04/icici-bank-personal-loan.html
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread Erik Hatcher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher resolved SOLR-1032.


   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Erik Hatcher

> CSV loader to support literal field values
> --
>
> Key: SOLR-1032
> URL: https://issues.apache.org/jira/browse/SOLR-1032
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Assignee: Erik Hatcher
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-1032.patch, SOLR-1032.patch
>
>
> It would be very handy if the CSV loader could handle a literal field 
> mapping, like the extracting request handler does.  For example, in a 
> scenario where you have multiple datasources (some data from a DB, some from 
> file crawls, and some from CSV) it is nice to add a field to every document 
> that specifies the data source.  This is easily done with DIH with a template 
> transformer, and Solr Cell with ext.literal.datasource=, but impossible 
> currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents

2011-08-02 Thread Markus Jelsma (JIRA)

!frange with query($qq) sets score=1.0f for all returned documents
--

 Key: SOLR-2689
 URL: https://issues.apache.org/jira/browse/SOLR-2689
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.4
Reporter: Markus Jelsma
 Fix For: 3.4, 4.0


Consider the following queries, both query the default field for 'test' and 
return the document digest and score (i don't seem to be able get only score, 
fl=score returns all fields):

This is a normal query and yields normal results with proper scores:

{code}
q=test&fl=digest,score
{code}

{code}

−

4.952673
c48e784f06a051d89f20b72194b0dcf0

−

4.952673
7f78a504b8cbd86c6cdbf2aa2c4ae5e3

−

4.952673
0f7fefa6586ceda42fc1f095d460aa17

{code}

This query uses frange with query() to limit the number of returned documents. 
When using multiple search terms i can indeed cut-off the result set but in the 
end all returned documents have score=1.0f. The final result set cannot be 
sorted by score anymore. The result set seems to be returned in the order of 
Lucene docId's.

{code}
q={!frange l=1.23}query($qq)&qq=test&fl=digest,score
{code}

{code}

−

1.0
c48e784f06a051d89f20b72194b0dcf0

−

1.0
7f78a504b8cbd86c6cdbf2aa2c4ae5e3

−

1.0
0f7fefa6586ceda42fc1f095d460aa17

{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-08-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076259#comment-13076259
 ] 

Jan Høydahl commented on SOLR-1979:
---

This has been tested on a real, several hundred thousand docs dataset, 
including HTML, office docs and multiple other formats and it works well.

I'd like some more pairs of eyes on this however.

One thing which is less than perfect is that the threshold conversion from Tika 
currently parses out the (internal) distance value from a String, in lack of a 
getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a 
beneficial one since we can now configure langid.threshold to something 
meaningful for our own data instead of the preset binary isReasonablyCertain(). 
As we also normalize to a value between 0-1, we abstract away the TIKA 
implementation detail, and are free to use any improved distance measures from 
TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika 
identifier or a hybrid solution.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2454) Would like link in site navigation to the ManifoldCF project

2011-08-02 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated SOLR-2454:
--

Attachment: SOLR-2454.patch

Patch for site reference to ManifoldCF


> Would like link in site navigation to the ManifoldCF project
> 
>
> Key: SOLR-2454
> URL: https://issues.apache.org/jira/browse/SOLR-2454
> Project: Solr
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Karl Wright
>Priority: Minor
> Attachments: SOLR-2454.patch
>
>
> The Solr/Lucene site points to lots of other Apache projects.  It would be 
> nice if it also pointed to ManifoldCF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread Simon Rosenthal (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076247#comment-13076247
 ] 

Simon Rosenthal commented on SOLR-1032:
---

revised patch looks good - do commit.

> CSV loader to support literal field values
> --
>
> Key: SOLR-1032
> URL: https://issues.apache.org/jira/browse/SOLR-1032
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Priority: Minor
> Attachments: SOLR-1032.patch, SOLR-1032.patch
>
>
> It would be very handy if the CSV loader could handle a literal field 
> mapping, like the extracting request handler does.  For example, in a 
> scenario where you have multiple datasources (some data from a DB, some from 
> file crawls, and some from CSV) it is nice to add a field to every document 
> that specifies the data source.  This is easily done with DIH with a template 
> transformer, and Solr Cell with ext.literal.datasource=, but impossible 
> currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker

2011-08-02 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076237#comment-13076237
 ] 

Mark Miller commented on SOLR-2688:
---

+1 - not only is it better in almost every way IMO, but it lets you avoid the 
very nasty IndexReader leak in the current index based API.

> switch solr 4.0 example to DirectSpellChecker
> -
>
> Key: SOLR-2688
> URL: https://issues.apache.org/jira/browse/SOLR-2688
> Project: Solr
>  Issue Type: Improvement
>  Components: spellchecker
>Affects Versions: 4.0
>Reporter: Robert Muir
>
> For discussion: we might want to switch the Solr 4.0 example to use 
> DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076232#comment-13076232
 ] 

Jan Høydahl commented on SOLR-1032:
---

Nice.

> CSV loader to support literal field values
> --
>
> Key: SOLR-1032
> URL: https://issues.apache.org/jira/browse/SOLR-1032
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Priority: Minor
> Attachments: SOLR-1032.patch, SOLR-1032.patch
>
>
> It would be very handy if the CSV loader could handle a literal field 
> mapping, like the extracting request handler does.  For example, in a 
> scenario where you have multiple datasources (some data from a DB, some from 
> file crawls, and some from CSV) it is nice to add a field to every document 
> that specifies the data source.  This is easily done with DIH with a template 
> transformer, and Solr Cell with ext.literal.datasource=, but impossible 
> currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread Erik Hatcher (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076211#comment-13076211
 ] 

Erik Hatcher commented on SOLR-1032:


Simon - thanks for the effort on this!  I have taken a look and updated the 
patch with a test case and a change to use _literal.field_name=value_ 
convention.  I think for the sake of this feature, it's best to stick with the 
established Solr Cell convention.  Perhaps in another issue we can take up 
refactoring parameter naming for this capability.

Thoughts?  Objections?   I'll commit this to trunk once I hear Simon's signoff.

> CSV loader to support literal field values
> --
>
> Key: SOLR-1032
> URL: https://issues.apache.org/jira/browse/SOLR-1032
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Priority: Minor
> Attachments: SOLR-1032.patch, SOLR-1032.patch
>
>
> It would be very handy if the CSV loader could handle a literal field 
> mapping, like the extracting request handler does.  For example, in a 
> scenario where you have multiple datasources (some data from a DB, some from 
> file crawls, and some from CSV) it is nice to add a field to every document 
> that specifies the data source.  This is easily done with DIH with a template 
> transformer, and Solr Cell with ext.literal.datasource=, but impossible 
> currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1032) CSV loader to support literal field values

2011-08-02 Thread Erik Hatcher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher updated SOLR-1032:
---

Attachment: SOLR-1032.patch

Attached is a patch adding a test case and switching to use the Solr Cell 
established convention of _literal.field_name=value_ parameter naming.

> CSV loader to support literal field values
> --
>
> Key: SOLR-1032
> URL: https://issues.apache.org/jira/browse/SOLR-1032
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Priority: Minor
> Attachments: SOLR-1032.patch, SOLR-1032.patch
>
>
> It would be very handy if the CSV loader could handle a literal field 
> mapping, like the extracting request handler does.  For example, in a 
> scenario where you have multiple datasources (some data from a DB, some from 
> file crawls, and some from CSV) it is nice to add a field to every document 
> that specifies the data source.  This is easily done with DIH with a template 
> transformer, and Solr Cell with ext.literal.datasource=, but impossible 
> currently with CSV.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker

2011-08-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076201#comment-13076201
 ] 

Michael McCandless commented on SOLR-2688:
--

+1


> switch solr 4.0 example to DirectSpellChecker
> -
>
> Key: SOLR-2688
> URL: https://issues.apache.org/jira/browse/SOLR-2688
> Project: Solr
>  Issue Type: Improvement
>  Components: spellchecker
>Affects Versions: 4.0
>Reporter: Robert Muir
>
> For discussion: we might want to switch the Solr 4.0 example to use 
> DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076199#comment-13076199
 ] 

Michael McCandless commented on LUCENE-3030:


bq. One trivial thing we might want to do is to add the logic currently in AQ's 
ctor to CA, so that you ask CA for its termsenum.

+1 -- I think CA should serve up a TermsEnum when provided a Terms?

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker

2011-08-02 Thread Robert Muir (JIRA)

switch solr 4.0 example to DirectSpellChecker
-

 Key: SOLR-2688
 URL: https://issues.apache.org/jira/browse/SOLR-2688
 Project: Solr
  Issue Type: Improvement
  Components: spellchecker
Affects Versions: 4.0
Reporter: Robert Muir


For discussion: we might want to switch the Solr 4.0 example to use 
DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076178#comment-13076178
 ] 

Robert Muir commented on LUCENE-3030:
-

Also, we should measure if a "prefix automaton" with intersect() is faster than 
PrefixTermsEnum (I suspect it could be!)

If this is true, we might want to not rewrite to prefixtermsenum anymore, 
instead changing PrefixQuery to extend AutomatonQuery too.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3030) Block tree terms dict & index

2011-08-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076177#comment-13076177
 ] 

Robert Muir commented on LUCENE-3030:
-

This is awesome, i really like adding the intersect() hook!

Thanks for making a branch, I will check it out and try to dive in to help with 
some of this  :)

One trivial thing we might want to do is to add the logic currently in AQ's 
ctor to CA, so that you ask CA for its termsenum.
this way, if it can be accomplished with a simpler enum like just 
terms.iterator() or prefixtermsenum etc etc we get that optimization always.

> Block tree terms dict & index
> -
>
> Key: LUCENE-3030
> URL: https://issues.apache.org/jira/browse/LUCENE-3030
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, 
> LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076171#comment-13076171
 ] 

Robert Muir commented on LUCENE-3220:
-

Hi David, i was thinking for the norm, we could store it like 
DefaultSimilarity. this would make it especially convenient, as you could 
easily use these similarities with the same exact index as one using Lucene's 
default scoring. Also I think (not sure!) by using 1/sqrt we will get better 
quantization from smallfloat?

{noformat}
  public byte computeNorm(FieldInvertState state) {
final int numTerms;
if (discountOverlaps)
  numTerms = state.getLength() - state.getNumOverlap();
else
  numTerms = state.getLength();
return encodeNormValue(state.getBoost() * ((float) (1.0 / 
Math.sqrt(numTerms;
  }
{noformat}

for computations, you have to 'undo' the sqrt() to get the quantized length, 
but thats ok since its only done up-front a single time and tableized, so it 
won't slow anything down.


> Implement various ranking models as Similarities
> 
>
> Key: LUCENE-3220
> URL: https://issues.apache.org/jira/browse/LUCENE-3220
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/query/scoring, core/search
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>  Labels: gsoc, gsoc2011
> Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> Done:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
>  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv

2011-08-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076156#comment-13076156
 ] 

Robert Muir commented on LUCENE-3335:
-

I don't think there is any sense in this, who cares?

We reported this crash to Oracle in plenty of time, and the *worse* 
wrong-results bug has been open since May 13: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738, but Oracle decided 
not to fix that, too.


> jrebug causes porter stemmer to sigsegv
> ---
>
> Key: LUCENE-3335
> URL: https://issues.apache.org/jira/browse/LUCENE-3335
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 
> 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 
> 3.3, 3.4, 4.0
> Environment: - JDK 7 Preview Release, GA (may also affect update _1, 
> targeted fix is JDK 1.7.0_2)
> - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
>Reporter: Robert Muir
>Assignee: Robert Muir
>  Labels: Java7
> Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, 
> patch-0uwe.patch
>
>
> happens easily on java7: ant test -Dtestcase=TestPorterStemFilter 
> -Dtests.iter=100
> might happen on 1.6.0_u26 too, a user reported something that looks like the 
> same bug already:
> http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv

2011-08-02 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076132#comment-13076132
 ] 

Uwe Schindler commented on LUCENE-3335:
---

@Shay: Sorry I did not want to be too italian :-) I just wanted to ensure that 
such configurations, leading to bugs in JVMs, would be reported to us. It would 
help us to also respond quicker on such bug reports, like the one we already 
got 2 months ago (which nobody was able to reproduce, as we did not know that 
the user used aggressive opts).

> jrebug causes porter stemmer to sigsegv
> ---
>
> Key: LUCENE-3335
> URL: https://issues.apache.org/jira/browse/LUCENE-3335
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 
> 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 
> 3.3, 3.4, 4.0
> Environment: - JDK 7 Preview Release, GA (may also affect update _1, 
> targeted fix is JDK 1.7.0_2)
> - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
>Reporter: Robert Muir
>Assignee: Robert Muir
>  Labels: Java7
> Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, 
> patch-0uwe.patch
>
>
> happens easily on java7: ant test -Dtestcase=TestPorterStemFilter 
> -Dtests.iter=100
> might happen on 1.6.0_u26 too, a user reported something that looks like the 
> same bug already:
> http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2687) Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and news.

2011-08-02 Thread Julian Copes (JIRA)

Add new Solr book 'Apache Solr 3.1 Cookbook' to selection of Solr books and 
news.
-

 Key: SOLR-2687
 URL: https://issues.apache.org/jira/browse/SOLR-2687
 Project: Solr
  Issue Type: Task
Reporter: Julian Copes


Find below the news of the new Solr book. I can provide an image when prompted. 
Below is a news item and I've included the URL for the new book. The text is as 
follows:


Rafał Kuć is proud to introduce a new book on Solr, "Apache Solr 3.1 Cookbook" 
from Packt Publishing.

The Solr 3.1 Cookbook will make your everyday work easier by using real-life 
examples that show you how to deal with the most common problems that can arise 
while using the Apache Solr search engine. 

This cookbook will show you how to get the most out of your search engine. Each 
chapter covers a different aspect of working with Solr from analyzing your text 
data through querying, performance improvement, and developing your own 
modules. The practical recipes will help you to quickly solve common problems 
with data analysis, show you how to use faceting to collect data and to speed 
up the performance of Solr. You will learn about functionalities that most 
newbies are unaware of, such as sorting results by a function value, 
highlighting matched words, and computing statistics to make your work with 
Solr easy and stress free.

Click here to read more about the Apache Solr 3.1 Cookbook. 
(http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/book)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv

2011-08-02 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076114#comment-13076114
 ] 

Dawid Weiss commented on LUCENE-3335:
-

Uwe has an Italian temper :) Btw. I really like the recent Yoda-discussion on 
concurrency-interest, Shay...

> jrebug causes porter stemmer to sigsegv
> ---
>
> Key: LUCENE-3335
> URL: https://issues.apache.org/jira/browse/LUCENE-3335
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 
> 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 
> 3.3, 3.4, 4.0
> Environment: - JDK 7 Preview Release, GA (may also affect update _1, 
> targeted fix is JDK 1.7.0_2)
> - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
>Reporter: Robert Muir
>Assignee: Robert Muir
>  Labels: Java7
> Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, 
> patch-0uwe.patch
>
>
> happens easily on java7: ant test -Dtestcase=TestPorterStemFilter 
> -Dtests.iter=100
> might happen on 1.6.0_u26 too, a user reported something that looks like the 
> same bug already:
> http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3357:


Labels: gsoc gsoc2011  (was: )

> Unit and integration test cases for the new Similarities
> 
>
> Key: LUCENE-3357
> URL: https://issues.apache.org/jira/browse/LUCENE-3357
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/query/scoring
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>Priority: Minor
>  Labels: gsoc, gsoc2011, test
> Fix For: flexscoring branch
>
>
> Write test cases to test the new Similarities added in 
> [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
> test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and 
> the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then 
> searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3357:


Labels: gsoc gsoc2011 test  (was: gsoc gsoc2011)

> Unit and integration test cases for the new Similarities
> 
>
> Key: LUCENE-3357
> URL: https://issues.apache.org/jira/browse/LUCENE-3357
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/query/scoring
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>Priority: Minor
>  Labels: gsoc, gsoc2011, test
> Fix For: flexscoring branch
>
>
> Write test cases to test the new Similarities added in 
> [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
> test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and 
> the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then 
> searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Component/s: core/query/scoring
 Labels: gsoc gsoc2011  (was: gsoc)

> Implement various ranking models as Similarities
> 
>
> Key: LUCENE-3220
> URL: https://issues.apache.org/jira/browse/LUCENE-3220
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/query/scoring, core/search
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>  Labels: gsoc, gsoc2011
> Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> Done:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
>  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)

Unit and integration test cases for the new Similarities


 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
 Fix For: flexscoring branch


Write test cases to test the new Similarities added in 
[LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
test cases will be created:
 * unit tests, in which mock statistics are provided to the Similarities and 
the score is validated against hand calculations;
 * integration tests, in which a small collection is indexed and then searched 
using the Similarities.

Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-02 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I 
could only upload this patch now but I didn't have time to work on Lucene the 
last week.

As I see, all the problems you mentioned have been corrected, so maybe we can 
go on with the review?

> Implement various ranking models as Similarities
> 
>
> Key: LUCENE-3220
> URL: https://issues.apache.org/jira/browse/LUCENE-3220
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: core/search
>Affects Versions: flexscoring branch
>Reporter: David Mark Nemeskey
>Assignee: David Mark Nemeskey
>  Labels: gsoc
> Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> Done:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
>  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3343) Comparison operators >,>=,<,<= and = support as RangeQuery syntax in QueryParser

2011-08-02 Thread Olivier Favre (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076091#comment-13076091
 ] 

Olivier Favre commented on LUCENE-3343:
---

Great, thanks!
No blockers for 3x?

> Comparison operators >,>=,<,<= and = support as RangeQuery syntax in 
> QueryParser
> 
>
> Key: LUCENE-3343
> URL: https://issues.apache.org/jira/browse/LUCENE-3343
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/queryparser
>Reporter: Olivier Favre
>Assignee: Adriano Crestani
>Priority: Minor
>  Labels: parser, query
> Fix For: 3.4, 4.0
>
> Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> To offer better interoperability with other search engines and to provide an 
> easier and more straight forward syntax,
> the operators >, >=, <, <= and = should be available to express an open range 
> query.
> They should at least work for numeric queries.
> '=' can be made a synonym for ':'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3356) trunk TestRollingUpdates.testRollingUpdates seed failure

2011-08-02 Thread selckin (JIRA)

trunk TestRollingUpdates.testRollingUpdates seed failure


 Key: LUCENE-3356
 URL: https://issues.apache.org/jira/browse/LUCENE-3356
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin


trunk r1152892
reproducable: always

{code}
junit-sequential:
[junit] Testsuite: org.apache.lucene.index.TestRollingUpdates
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.168 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestRollingUpdates 
-Dtestmethod=testRollingUpdates 
-Dtests.seed=-5322802004404580273:-4001225075726350391
[junit] WARNING: test method: 'testRollingUpdates' left thread running: 
merge thread: _c(4.0):cv3/2 _h(4.0):cv3 into _k
[junit] RESOURCE LEAK: test method: 'testRollingUpdates' left 1 thread(s) 
running
[junit] NOTE: test params are: codec=RandomCodecProvider: {docid=Standard, 
body=SimpleText, title=MockSep, titleTokenized=Pulsing(freqCutoff=20), 
date=MockFixedIntBlock(blockSize=1474)}, locale=lv_LV, timezone=Pacific/Fiji
[junit] NOTE: all tests run in this JVM:
[junit] [TestRollingUpdates]
[junit] NOTE: Linux 2.6.39-gentoo amd64/Sun Microsystems Inc. 1.6.0_26 
(64-bit)/cpus=8,threads=1,free=128782656,total=158400512
[junit] -  ---
[junit] Testcase: 
testRollingUpdates(org.apache.lucene.index.TestRollingUpdates):   FAILED
[junit] expected:<20> but was:<21>
[junit] junit.framework.AssertionFailedError: expected:<20> but was:<21>
[junit] at 
org.apache.lucene.index.TestRollingUpdates.testRollingUpdates(TestRollingUpdates.java:76)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
[junit] 
[junit] 
[junit] Test org.apache.lucene.index.TestRollingUpdates FAILED

{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

67 matches

Mail list logo