RE: Proposal Status, Initial Committors List, Contributors List

2011-01-08 Thread Digy
Thanks Troy,

It is very good.

DIGY

-Original Message-
From: Troy Howard [mailto:thowar...@gmail.com] 
Sent: Thursday, January 06, 2011 7:47 PM
To: lucene-net-dev@lucene.apache.org
Subject: Proposal Status, Initial Committors List, Contributors List

All,

Thanks for all the recent activity in the mailing lists. I'm really
eager to get this project moving forward and the discussions going on
now are exactly what we need to do that.

Calling attention back to the Incubator proposal, the outstanding
needs for completing that proposal are:
- Build Initial Committers list
- Ensure that Committers have all submitted a CLA
- Ensure that the proposal is inline with community interest

The current draft of the proposal is located at:

http://wiki.apache.org/incubator/Lucene.Net%20Proposal


For each of those goals:

Build Initial Committers List
---

I have updated the proposal to reflect the current state of the
Initial Committers list. The list, at present is (alphabetical):

- Chris Currens
- DIGY
- Michael Herndon
- Prescott Nasser
- Scott Lombard
- Sergey Mirvoda
- Troy Howard

The only other person who has been discussed as a Committer, but
hasn't stated formally that they are interested in that role is Heath
Aldrich  (there was some discussion, but no resolution). So, Heath, if
you'd like to be on the Initial Committers list, please send a quick
message indicating your interest.

Additionally, the following people have come forward as willing
Contributors (alphabetical):

 - Alex Thompson
 - Ben Martz
 - Frank Yu
 - Glyn Darkin
 - Peter Mateja
 - Shashi Kant
 - Simone Chiaretta
 - Wyatt Barnett


Ensure that Committers Have All Submitted a CLA
---

For those of you on the Initial Committers list, we will need to
submit CLAs for each developer before being granted commit access.
Currently the only person on that list who has submitted a CLA is
DIGY. I'll be sending mine in today, and I encourage the rest of you
to do so by the end of the week.

Information about the CLA, and how to submit is located here:

http://www.apache.org/licenses/#clas



Ensure That the Proposal Is Inline With Community Interest
--


So far, I have not heard any feedback from the community about the
text of the Incubator Proposal. Please review the current draft, and
if you have any reservations about the contents or language, feel that
anything is missing or should be omitted, please post to the mailing
list expressing your concerns or ideas.

I will be submitting the proposal on Tuesday, January 11th, so please
review it and discuss prior to that. I want to make sure that everyone
who is effected by our proposal has had the opportunity to review it,
and either determine that they completely agree with it or give them
the opportunity to discuss their opinions openly prior to submission.

Again, the current draft of the proposal is located at:

http://wiki.apache.org/incubator/Lucene.Net%20Proposal


And, even though I sign most of my emails 'Thanks', I'd like to take a
second and express my sincere appreciation for the community we have
around this project and the effort and investments that have been
given and will continue to be given in the future by our contributors.
The project could not survive without it.

Thanks,
Troy



RE: Proposal Status, Initial Committors List, Contributors List

2011-01-08 Thread srikalyan swayampakula

Hi Everyone,
  I would like to be a contributor for this project. Is 
there any chance to be one as I believe you have already decided on the list of 
people.
 
Thanks,
~Sri.


 
 From: digyd...@gmail.com
 To: lucene-net-dev@lucene.apache.org
 Subject: RE: Proposal Status, Initial Committors List, Contributors List
 Date: Sat, 8 Jan 2011 16:18:36 +0200
 
 Thanks Troy,
 
 It is very good.
 
 DIGY
 
 -Original Message-
 From: Troy Howard [mailto:thowar...@gmail.com] 
 Sent: Thursday, January 06, 2011 7:47 PM
 To: lucene-net-dev@lucene.apache.org
 Subject: Proposal Status, Initial Committors List, Contributors List
 
 All,
 
 Thanks for all the recent activity in the mailing lists. I'm really
 eager to get this project moving forward and the discussions going on
 now are exactly what we need to do that.
 
 Calling attention back to the Incubator proposal, the outstanding
 needs for completing that proposal are:
 - Build Initial Committers list
 - Ensure that Committers have all submitted a CLA
 - Ensure that the proposal is inline with community interest
 
 The current draft of the proposal is located at:
 
 http://wiki.apache.org/incubator/Lucene.Net%20Proposal
 
 
 For each of those goals:
 
 Build Initial Committers List
 ---
 
 I have updated the proposal to reflect the current state of the
 Initial Committers list. The list, at present is (alphabetical):
 
 - Chris Currens
 - DIGY
 - Michael Herndon
 - Prescott Nasser
 - Scott Lombard
 - Sergey Mirvoda
 - Troy Howard
 
 The only other person who has been discussed as a Committer, but
 hasn't stated formally that they are interested in that role is Heath
 Aldrich (there was some discussion, but no resolution). So, Heath, if
 you'd like to be on the Initial Committers list, please send a quick
 message indicating your interest.
 
 Additionally, the following people have come forward as willing
 Contributors (alphabetical):
 
 - Alex Thompson
 - Ben Martz
 - Frank Yu
 - Glyn Darkin
 - Peter Mateja
 - Shashi Kant
 - Simone Chiaretta
 - Wyatt Barnett
 
 
 Ensure that Committers Have All Submitted a CLA
 ---
 
 For those of you on the Initial Committers list, we will need to
 submit CLAs for each developer before being granted commit access.
 Currently the only person on that list who has submitted a CLA is
 DIGY. I'll be sending mine in today, and I encourage the rest of you
 to do so by the end of the week.
 
 Information about the CLA, and how to submit is located here:
 
 http://www.apache.org/licenses/#clas
 
 
 
 Ensure That the Proposal Is Inline With Community Interest
 --
 
 
 So far, I have not heard any feedback from the community about the
 text of the Incubator Proposal. Please review the current draft, and
 if you have any reservations about the contents or language, feel that
 anything is missing or should be omitted, please post to the mailing
 list expressing your concerns or ideas.
 
 I will be submitting the proposal on Tuesday, January 11th, so please
 review it and discuss prior to that. I want to make sure that everyone
 who is effected by our proposal has had the opportunity to review it,
 and either determine that they completely agree with it or give them
 the opportunity to discuss their opinions openly prior to submission.
 
 Again, the current draft of the proposal is located at:
 
 http://wiki.apache.org/incubator/Lucene.Net%20Proposal
 
 
 And, even though I sign most of my emails 'Thanks', I'd like to take a
 second and express my sincere appreciation for the community we have
 around this project and the effort and investments that have been
 given and will continue to be given in the future by our contributors.
 The project could not survive without it.
 
 Thanks,
 Troy
 
  

RE: Proposal Status, Initial Committors List, Contributors List

2011-01-08 Thread Digy
I think there are some misunderstandings about how to contribute to
Lucene.Net.
Everyone is free to grab to source code, work on it and post the
bugs/improvements/new features. They are always welcomed.

DIGY

-Original Message-
From: srikalyan swayampakula [mailto:srikalyansswa...@hotmail.com] 
Sent: Saturday, January 08, 2011 11:13 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Proposal Status, Initial Committors List, Contributors List


Hi Everyone,
  I would like to be a contributor for this project. Is
there any chance to be one as I believe you have already decided on the list
of people.
 
Thanks,
~Sri.


 
 From: digyd...@gmail.com
 To: lucene-net-dev@lucene.apache.org
 Subject: RE: Proposal Status, Initial Committors List, Contributors List
 Date: Sat, 8 Jan 2011 16:18:36 +0200
 
 Thanks Troy,
 
 It is very good.
 
 DIGY
 
 -Original Message-
 From: Troy Howard [mailto:thowar...@gmail.com] 
 Sent: Thursday, January 06, 2011 7:47 PM
 To: lucene-net-dev@lucene.apache.org
 Subject: Proposal Status, Initial Committors List, Contributors List
 
 All,
 
 Thanks for all the recent activity in the mailing lists. I'm really
 eager to get this project moving forward and the discussions going on
 now are exactly what we need to do that.
 
 Calling attention back to the Incubator proposal, the outstanding
 needs for completing that proposal are:
 - Build Initial Committers list
 - Ensure that Committers have all submitted a CLA
 - Ensure that the proposal is inline with community interest
 
 The current draft of the proposal is located at:
 
 http://wiki.apache.org/incubator/Lucene.Net%20Proposal
 
 
 For each of those goals:
 
 Build Initial Committers List
 ---
 
 I have updated the proposal to reflect the current state of the
 Initial Committers list. The list, at present is (alphabetical):
 
 - Chris Currens
 - DIGY
 - Michael Herndon
 - Prescott Nasser
 - Scott Lombard
 - Sergey Mirvoda
 - Troy Howard
 
 The only other person who has been discussed as a Committer, but
 hasn't stated formally that they are interested in that role is Heath
 Aldrich (there was some discussion, but no resolution). So, Heath, if
 you'd like to be on the Initial Committers list, please send a quick
 message indicating your interest.
 
 Additionally, the following people have come forward as willing
 Contributors (alphabetical):
 
 - Alex Thompson
 - Ben Martz
 - Frank Yu
 - Glyn Darkin
 - Peter Mateja
 - Shashi Kant
 - Simone Chiaretta
 - Wyatt Barnett
 
 
 Ensure that Committers Have All Submitted a CLA

---
 
 For those of you on the Initial Committers list, we will need to
 submit CLAs for each developer before being granted commit access.
 Currently the only person on that list who has submitted a CLA is
 DIGY. I'll be sending mine in today, and I encourage the rest of you
 to do so by the end of the week.
 
 Information about the CLA, and how to submit is located here:
 
 http://www.apache.org/licenses/#clas
 
 
 
 Ensure That the Proposal Is Inline With Community Interest


--
 
 
 So far, I have not heard any feedback from the community about the
 text of the Incubator Proposal. Please review the current draft, and
 if you have any reservations about the contents or language, feel that
 anything is missing or should be omitted, please post to the mailing
 list expressing your concerns or ideas.
 
 I will be submitting the proposal on Tuesday, January 11th, so please
 review it and discuss prior to that. I want to make sure that everyone
 who is effected by our proposal has had the opportunity to review it,
 and either determine that they completely agree with it or give them
 the opportunity to discuss their opinions openly prior to submission.
 
 Again, the current draft of the proposal is located at:
 
 http://wiki.apache.org/incubator/Lucene.Net%20Proposal
 
 
 And, even though I sign most of my emails 'Thanks', I'd like to take a
 second and express my sincere appreciation for the community we have
 around this project and the effort and investments that have been
 given and will continue to be given in the future by our contributors.
 The project could not survive without it.
 
 Thanks,
 Troy
 
  



LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright
This list might be interested to know that the current Solr LICENSE and NOTICE 
file contents are not Apache standard.  The ManifoldCF project based its 
LICENSE and NOTICE files on the Solr ones and got the following icy reception 
in the incubator:


The NOTICE file is still incorrect and includes a lot of unnecessary
stuff. Understanding how to do releases with the correct legal files
is one of the important parts of incubation and as this is the first
release for the poddling i think this needs to be sorted out.

For the NOTICE file, start with the following text (between the ---'s):

---
Apache ManifestCF
Copyright 2010 The Apache Software Foundation

This product includes software developed by
The Apache Software Foundation (http://www.apache.org/).
---

and then add _nothing_ unless you can find explicit policy documented
somewhere in the ASF that says it is required. If someone wants to add
something ask for the URL where the requirement is documented. The
NOTICE file should only include required notices, the other text thats
in the current NOTICE file could go in a README file, see
http://www.apache.org/legal/src-headers.html#notice

For the LICENSE file, it should start with the AL as the current one
does, and then include the text for all the other licenses used in the
distribution. Those license that are currently in the NOTICE file
should be moved to the LICENSE file and then you need to verify that
all the 3rd party dependencies in the src and binary distributions are
also in the LICENSE files of those distributions.



Our NOTICE includes the following, which was taken from Solr (because we have a 
similar dependency).  I'd like to know whether it is a valid thing to include, 
and where it says that somewhere in Apache:


=
== Jetty Notice==
=
==
 Jetty Web Container 
 Copyright 1995-2006 Mort Bay Consulting Pty Ltd
==

This product includes some software developed at The Apache Software 
Foundation (http://www.apache.org/).

The javax.servlet package used by Jetty is copyright 
Sun Microsystems, Inc and Apache Software Foundation. It is 
distributed under the Common Development and Distribution License.
You can obtain a copy of the license at 
https://glassfish.dev.java.net/public/CDDLv1.0.html.

The UnixCrypt.java code ~Implements the one way cryptography used by
Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
modified April 2001  by Iris Van den Broeke, Daniel Deville.

The default JSP implementation is provided by the Glassfish JSP engine
from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.

Some portions of the code are Copyright:
  2006 Tim Vernum 
  1999 Jason Gilbert.

The jboss integration module contains some LGPL code.

=
== HSQLDB Notice   ==
=

For content, code, and products originally developed by Thomas Mueller and the 
Hypersonic SQL Group:

Copyright (c) 1995-2000 by the Hypersonic SQL Group.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

Neither the name of the Hypersonic SQL Group nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP,
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This software consists of voluntary 

Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/

2011-01-08 Thread Robert Muir
On Fri, Jan 7, 2011 at 10:47 PM,  hoss...@apache.org wrote:

 +  public static final SetString EMPTY_STRING_SET = Collections.emptySet();
 +

I don't know about this commit... i see a lot of EMPTY set's and maps
defined statically here.
There is no advantage to doing this, even the javadocs explain:
Implementation note: Implementations of this method need not create a
separate (Set|Map|List) object for each call. Using this method is
likely to have comparable cost to using the like-named field. (Unlike
this method, the field does not provide type safety.)

I think we should be using the Collection methods, for example on your
first file:

Index: solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java
===
--- solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java   
(revision
1056691)
+++ solr/src/java/org/apache/solr/handler/AnalysisRequestHandlerBase.java   
(working
copy)
@@ -47,8 +47,6 @@
  */
 public abstract class AnalysisRequestHandlerBase extends RequestHandlerBase {

-  public static final SetString EMPTY_STRING_SET = Collections.emptySet();
-
   public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) throws Exception {
 rsp.add(analysis, doAnalysis(req));
   }
@@ -343,7 +341,7 @@
  *
  */
 public AnalysisContext(String fieldName, FieldType fieldType,
Analyzer analyzer) {
-  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
+  this(fieldName, fieldType, analyzer, Collections.StringemptySet());
 }

 /**
I

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Earwin Burrfoot
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
 Cominventjan@cominvent.com wrote:
 The problem with large start is probably worse when sharding is involved. 
 Anyone know how the shard component goes about fetching 
 start=100rows=10 from say 10 shards? Does it have to merge sorted lists 
 of 1mill+10 docsids from each shard which is the worst case?

 Yep, that's how it works today.


Technically, if your docs have a non-biased (in regards to their
sort-value) distribution across shards, you can fetch much less than
topN docs from each shard.
I played with the idea, and it worked for me. Though later I dropped
the opto, as it complicated things somewhat and my users aren't
querying gazillions of docs often.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Robert Muir
You are probably right... the LICENSE.txt also contains many instances
of incorrect capitalization, I noticed that all versions of of this
file I can find anywhere have this problem :)

On Sat, Jan 8, 2011 at 6:14 AM,  karl.wri...@nokia.com wrote:
 This list might be interested to know that the current Solr LICENSE and 
 NOTICE file contents are not Apache standard.  The ManifoldCF project based 
 its LICENSE and NOTICE files on the Solr ones and got the following icy 
 reception in the incubator:


 The NOTICE file is still incorrect and includes a lot of unnecessary
 stuff. Understanding how to do releases with the correct legal files
 is one of the important parts of incubation and as this is the first
 release for the poddling i think this needs to be sorted out.

 For the NOTICE file, start with the following text (between the ---'s):

 ---
 Apache ManifestCF
 Copyright 2010 The Apache Software Foundation

 This product includes software developed by
 The Apache Software Foundation (http://www.apache.org/).
 ---

 and then add _nothing_ unless you can find explicit policy documented
 somewhere in the ASF that says it is required. If someone wants to add
 something ask for the URL where the requirement is documented. The
 NOTICE file should only include required notices, the other text thats
 in the current NOTICE file could go in a README file, see
 http://www.apache.org/legal/src-headers.html#notice

 For the LICENSE file, it should start with the AL as the current one
 does, and then include the text for all the other licenses used in the
 distribution. Those license that are currently in the NOTICE file
 should be moved to the LICENSE file and then you need to verify that
 all the 3rd party dependencies in the src and binary distributions are
 also in the LICENSE files of those distributions.

 

 Our NOTICE includes the following, which was taken from Solr (because we have 
 a similar dependency).  I'd like to know whether it is a valid thing to 
 include, and where it says that somewhere in Apache:


 =
 ==     Jetty Notice                                                    ==
 =
 ==
  Jetty Web Container
  Copyright 1995-2006 Mort Bay Consulting Pty Ltd
 ==

 This product includes some software developed at The Apache Software
 Foundation (http://www.apache.org/).

 The javax.servlet package used by Jetty is copyright
 Sun Microsystems, Inc and Apache Software Foundation. It is
 distributed under the Common Development and Distribution License.
 You can obtain a copy of the license at
 https://glassfish.dev.java.net/public/CDDLv1.0.html.

 The UnixCrypt.java code ~Implements the one way cryptography used by
 Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
 modified April 2001  by Iris Van den Broeke, Daniel Deville.

 The default JSP implementation is provided by the Glassfish JSP engine
 from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.

 Some portions of the code are Copyright:
  2006 Tim Vernum
  1999 Jason Gilbert.

 The jboss integration module contains some LGPL code.

 =
 ==     HSQLDB Notice                                                   ==
 =

 For content, code, and products originally developed by Thomas Mueller and 
 the Hypersonic SQL Group:

 Copyright (c) 1995-2000 by the Hypersonic SQL Group.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:

 Redistributions of source code must retain the above copyright notice, this
 list of conditions and the following disclaimer.

 Redistributions in binary form must reproduce the above copyright notice,
 this list of conditions and the following disclaimer in the documentation
 and/or other materials provided with the distribution.

 Neither the name of the Hypersonic SQL Group nor the names of its
 contributors may be used to endorse or promote products derived from this
 software without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS
 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP,
 OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 LOSS OF 

[jira] Commented: (LUCENE-2831) Revise Weight#scorer Filter#getDocIdSet API to pass Readers context

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979118#action_12979118
 ] 

Michael McCandless commented on LUCENE-2831:


bq. It seems we also need to migrate FieldComparator to use ReaderContext 
(eventually AtomicReaderContext)?

+1

And also Collector?

 Revise Weight#scorer  Filter#getDocIdSet API to pass Readers context
 -

 Key: LUCENE-2831
 URL: https://issues.apache.org/jira/browse/LUCENE-2831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
 LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch


 Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
 boolean, boolean) we should / could revise the API and pass in a struct that 
 has parent reader, sub reader, ord of that sub. The ord mapping plus the 
 context with its parent would make several issues way easier. See 
 LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2854:
---

Attachment: LUCENE-2854.patch

I think we should simply make a hard break on the Sim.lengthNorm -
computeNorm cutover.  Subclassing sim is an expert thing, and, I'd
rather apps see a compilation error on upgrade so that they realize
their lengthNorm wasn't being called this whole time because of
LUCENE-2828 (and that they must now cutover to computeNorm).

So I made lengthNorm final (and throws UOE), computeNorm abstract.  I
deprecated SimilarityDelegator, and fixed BQ to not use it anymore.
The only other use is FuzzyLikeThisQuery, but fixing that is a little
too involved for today.


 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright
From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
this question. ;-)

Karl


From: ext Robert Muir [rcm...@gmail.com]
Sent: Saturday, January 08, 2011 7:11 AM
To: dev@lucene.apache.org
Subject: Re: LICENSE/NOTICE file contents

You are probably right... the LICENSE.txt also contains many instances
of incorrect capitalization, I noticed that all versions of of this
file I can find anywhere have this problem :)

On Sat, Jan 8, 2011 at 6:14 AM,  karl.wri...@nokia.com wrote:
 This list might be interested to know that the current Solr LICENSE and 
 NOTICE file contents are not Apache standard.  The ManifoldCF project based 
 its LICENSE and NOTICE files on the Solr ones and got the following icy 
 reception in the incubator:


 The NOTICE file is still incorrect and includes a lot of unnecessary
 stuff. Understanding how to do releases with the correct legal files
 is one of the important parts of incubation and as this is the first
 release for the poddling i think this needs to be sorted out.

 For the NOTICE file, start with the following text (between the ---'s):

 ---
 Apache ManifestCF
 Copyright 2010 The Apache Software Foundation

 This product includes software developed by
 The Apache Software Foundation (http://www.apache.org/).
 ---

 and then add _nothing_ unless you can find explicit policy documented
 somewhere in the ASF that says it is required. If someone wants to add
 something ask for the URL where the requirement is documented. The
 NOTICE file should only include required notices, the other text thats
 in the current NOTICE file could go in a README file, see
 http://www.apache.org/legal/src-headers.html#notice

 For the LICENSE file, it should start with the AL as the current one
 does, and then include the text for all the other licenses used in the
 distribution. Those license that are currently in the NOTICE file
 should be moved to the LICENSE file and then you need to verify that
 all the 3rd party dependencies in the src and binary distributions are
 also in the LICENSE files of those distributions.

 

 Our NOTICE includes the following, which was taken from Solr (because we have 
 a similar dependency).  I'd like to know whether it is a valid thing to 
 include, and where it says that somewhere in Apache:


 =
 == Jetty Notice==
 =
 ==
  Jetty Web Container
  Copyright 1995-2006 Mort Bay Consulting Pty Ltd
 ==

 This product includes some software developed at The Apache Software
 Foundation (http://www.apache.org/).

 The javax.servlet package used by Jetty is copyright
 Sun Microsystems, Inc and Apache Software Foundation. It is
 distributed under the Common Development and Distribution License.
 You can obtain a copy of the license at
 https://glassfish.dev.java.net/public/CDDLv1.0.html.

 The UnixCrypt.java code ~Implements the one way cryptography used by
 Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
 modified April 2001  by Iris Van den Broeke, Daniel Deville.

 The default JSP implementation is provided by the Glassfish JSP engine
 from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.

 Some portions of the code are Copyright:
  2006 Tim Vernum
  1999 Jason Gilbert.

 The jboss integration module contains some LGPL code.

 =
 == HSQLDB Notice   ==
 =

 For content, code, and products originally developed by Thomas Mueller and 
 the Hypersonic SQL Group:

 Copyright (c) 1995-2000 by the Hypersonic SQL Group.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:

 Redistributions of source code must retain the above copyright notice, this
 list of conditions and the following disclaimer.

 Redistributions in binary form must reproduce the above copyright notice,
 this list of conditions and the following disclaimer in the documentation
 and/or other materials provided with the distribution.

 Neither the name of the Hypersonic SQL Group nor the names of its
 contributors may be used to endorse or promote products derived from this
 software without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE 

Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Grant Ingersoll
The weird thing is, all of our collectors, IMO, are optimized for the 
non-paging scenario, when I would venture to guess that the very large majority 
of users out there do paging.  AFAICT, about the only people who don't do 
paging are those who do deep, downstream analysis which requires them to 
retrieve 100's or 1000's or more of results at a time (I've seen as much as a 
million used in production) as part of a batch job.

See https://issues.apache.org/jira/browse/LUCENE-2215 and 
https://issues.apache.org/jira/browse/SOLR-1726 for the issues tracking this.

-Grant

On Jan 8, 2011, at 7:11 AM, Earwin Burrfoot wrote:

 On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
 Cominventjan@cominvent.com wrote:
 The problem with large start is probably worse when sharding is involved. 
 Anyone know how the shard component goes about fetching 
 start=100rows=10 from say 10 shards? Does it have to merge sorted 
 lists of 1mill+10 docsids from each shard which is the worst case?
 
 Yep, that's how it works today.
 
 
 Technically, if your docs have a non-biased (in regards to their
 sort-value) distribution across shards, you can fetch much less than
 topN docs from each shard.
 I played with the idea, and it worked for me. Though later I dropped
 the opto, as it complicated things somewhat and my users aren't
 querying gazillions of docs often.
 
 
 -- 
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Phone: +7 (495) 683-567-4
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979128#action_12979128
 ] 

Michael McCandless commented on LUCENE-2854:


The above patch applies to 3.x

For trunk I plan to remove SimliarityDelegator from core, and move it 
(deprecated) into contrib/queries/... (private to FuzzyLikeThisQ).  At some 
point [later] we can fix FuzzyLikeThisQ to not use it...

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979129#action_12979129
 ] 

Michael McCandless commented on LUCENE-2324:


bq. I guess we don't really need the global lock. A thread performing the 
global flush could still acquire each thread state before it starts flushing, 
but return a threadState to the pool once that particular threadState is done 
flushing?

Good question... we could (in theory) also flush them concurrently?  But, since 
we don't own the threads in IW, we can't easily do that, so I think no global 
lock, go through all DWPTs w/ current thread and flush, sequentially?  So all 
that's guaranteed after the global flush() returns is that all state present 
prior to when flush() is invoked, is moved to disk.  Ie if addDocs are still 
happening concurrently then the DWPTs will start filling up again even while 
the global flush runs.  That's fine.

{quote}

A related question is: Do we want to piggyback on multiple threads when a 
global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards 
addDocument(). When should addDocument() happen? 
a) After all DWPTs finished flushing? 
b) After at least one DWPT finished flushing and is available again?
c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

a) is currently implemented, but I think not really what we want.
b) is probably best for RT, because it means the lowest indexing latency for 
the new document to be added.
c) probably means the best overall throughput (depending even on hardware like 
disk speed, etc)
{quote}

I think start simple -- the addDocument always happens?  Ie it's never 
coordinated w/ the ongoing flush.  It picks a free DWPT like normal, and since 
flush is single threaded, there should always be a free DWPT?

Longer term c) would be great, or, if IW has an ES then it'd send multiple 
flush jobs to the ES.

{quote}
For whatever option we pick, we'll have to carefully think about error 
handling. It's quite straightforward for a) (just commit all flushed segments 
to SegmentInfos when the global flush completed succesfully). But for b) and c) 
it's unclear what should happen if a DWPT flush fails after some completed 
already successfully before.
{quote}

I think we should continue what we do today?  Ie, if it's an 'aborting' 
exception, then the entire segment held by that DWPT is discarded?  And we then 
throw this exc back to caller (and don't try to flush any other segments)?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LICENSE/NOTICE file contents

2011-01-08 Thread Grant Ingersoll
Because they are shipped with Solr.  I don't see why it hurts to give people 
information about what's in the download.


On Jan 8, 2011, at 8:10 AM, karl.wri...@nokia.com karl.wri...@nokia.com 
wrote:

 From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
 Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
 Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
 this question. ;-)
 
 Karl
 
 
 From: ext Robert Muir [rcm...@gmail.com]
 Sent: Saturday, January 08, 2011 7:11 AM
 To: dev@lucene.apache.org
 Subject: Re: LICENSE/NOTICE file contents
 
 You are probably right... the LICENSE.txt also contains many instances
 of incorrect capitalization, I noticed that all versions of of this
 file I can find anywhere have this problem :)
 
 On Sat, Jan 8, 2011 at 6:14 AM,  karl.wri...@nokia.com wrote:
 This list might be interested to know that the current Solr LICENSE and 
 NOTICE file contents are not Apache standard.  The ManifoldCF project based 
 its LICENSE and NOTICE files on the Solr ones and got the following icy 
 reception in the incubator:
 
 
 The NOTICE file is still incorrect and includes a lot of unnecessary
 stuff. Understanding how to do releases with the correct legal files
 is one of the important parts of incubation and as this is the first
 release for the poddling i think this needs to be sorted out.
 
 For the NOTICE file, start with the following text (between the ---'s):
 
 ---
 Apache ManifestCF
 Copyright 2010 The Apache Software Foundation
 
 This product includes software developed by
 The Apache Software Foundation (http://www.apache.org/).
 ---
 
 and then add _nothing_ unless you can find explicit policy documented
 somewhere in the ASF that says it is required. If someone wants to add
 something ask for the URL where the requirement is documented. The
 NOTICE file should only include required notices, the other text thats
 in the current NOTICE file could go in a README file, see
 http://www.apache.org/legal/src-headers.html#notice
 
 For the LICENSE file, it should start with the AL as the current one
 does, and then include the text for all the other licenses used in the
 distribution. Those license that are currently in the NOTICE file
 should be moved to the LICENSE file and then you need to verify that
 all the 3rd party dependencies in the src and binary distributions are
 also in the LICENSE files of those distributions.
 
 
 
 Our NOTICE includes the following, which was taken from Solr (because we 
 have a similar dependency).  I'd like to know whether it is a valid thing to 
 include, and where it says that somewhere in Apache:
 
 
 =
 == Jetty Notice==
 =
 ==
 Jetty Web Container
 Copyright 1995-2006 Mort Bay Consulting Pty Ltd
 ==
 
 This product includes some software developed at The Apache Software
 Foundation (http://www.apache.org/).
 
 The javax.servlet package used by Jetty is copyright
 Sun Microsystems, Inc and Apache Software Foundation. It is
 distributed under the Common Development and Distribution License.
 You can obtain a copy of the license at
 https://glassfish.dev.java.net/public/CDDLv1.0.html.
 
 The UnixCrypt.java code ~Implements the one way cryptography used by
 Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
 modified April 2001  by Iris Van den Broeke, Daniel Deville.
 
 The default JSP implementation is provided by the Glassfish JSP engine
 from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.
 
 Some portions of the code are Copyright:
 2006 Tim Vernum
 1999 Jason Gilbert.
 
 The jboss integration module contains some LGPL code.
 
 =
 == HSQLDB Notice   ==
 =
 
 For content, code, and products originally developed by Thomas Mueller and 
 the Hypersonic SQL Group:
 
 Copyright (c) 1995-2000 by the Hypersonic SQL Group.
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:
 
 Redistributions of source code must retain the above copyright notice, this
 list of conditions and the following disclaimer.
 
 Redistributions in binary form must reproduce the above copyright notice,
 this list of conditions and the following disclaimer in the documentation
 and/or other materials provided with the distribution.

Re: LICENSE/NOTICE file contents

2011-01-08 Thread Robert Muir
On Sat, Jan 8, 2011 at 10:06 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 There also wasn't any business about and then add _nothing_ unless
 you can find explicit policy documented
 somewhere in the ASF that says it is required.  I was following
 examples from other projects and any docs I could find at the time,
 but this was back in '06.


Not sure there is now either, this is likely just someone's opinion.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979138#action_12979138
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}So all that's guaranteed after the global flush() returns is that all
state present prior to when flush() is invoked, is moved to disk. Ie if addDocs
are still happening concurrently then the DWPTs will start filling up again
even while the global flush runs. That's fine.{quote}

What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?

Also I think we need to clear the thread bindings of a DWPT just prior to the
flush of the DWPT? Otherwise (when multiple threads are mapped to a single
DWPT) the other threads will wait on the [main] DWPT flush when they should be
spinning up a new DWPT? 

Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs
or will that noticeably affect indexing performance? 

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright


Nope - wasn't me that added the license stuff into NOTICE.txt ;-)
But, including Jetty's NOTICE seems appropriate for our NOTICE.  It's
just the license parts of the HSQLDB and SLF4J that should be moved to
LICENSE.txt


The NOTICE text is actually different from the LICENSE text for HSQLDB, which 
is why I thought it must have come from an HSQLDB NOTICE file.

Karl


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979139#action_12979139
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also, don't we need the global lock for commit/close?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2854:


Attachment: LUCENE-2854_fuzzylikethis.patch

here is the patch for fuzzylikethis for trunk... so you can remove the 
delegator completely in trunk.


 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979141#action_12979141
 ] 

Robert Muir commented on LUCENE-2854:
-

Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

{noformat}
  /** Expert: Returns the Similarity implementation to be used for this query.
   * Subclasses may override this method to specify their own Similarity
   * implementation, perhaps one that delegates through that of the Searcher.
   * By default the Searcher's Similarity implementation is returned.*/
{noformat}

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 3511 - Failure

2011-01-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3511/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety

Error Message:
unable to create new native thread

Stack Trace:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:614)
at 
org.apache.lucene.search.TestThreadSafe.doTest(TestThreadSafe.java:133)
at 
org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety(TestThreadSafe.java:152)
at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:255)




Build Log (for compile errors):
[...truncated 8566 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version-5.patch

Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979144#action_12979144
 ] 

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:


Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

  was (Author: teofili):
Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.
  
 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979144#action_12979144
 ] 

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:


Changes are:
- Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
- Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
- Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path.
- The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

  was (Author: teofili):
Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.
  
 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979146#action_12979146
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?
{quote}

What's a hard flush?

With the proposed approach, all docs added (or in the process of being added) 
will make it into the flushed segments once the flush returns; newly added docs 
after the flush call started may or not make it.  But this is fine?  I mean, if 
the app has stronger requirements then it should externally sync?

bq. Also I think we need to clear the thread bindings of a DWPT just prior to 
the flush of the DWPT? 

Right.

As soon as a DWPT is pulled from production for flushing, it loses all thread 
affinity and becomes unavailable until its flush finishes.  When a thread needs 
a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, 
it picks a new one.  If none are available but we are below our max DWPT count, 
it spins up a new one?

{quote}
Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling?
{quote}

Why would we lose them?  Wouldn't that DWPT just go back into rotation once the 
flush is done?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979149#action_12979149
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}As soon as a DWPT is pulled from production for flushing, it loses all 
thread affinity and becomes unavailable until its flush finishes. When a thread 
needs a DWPT, it tries to pick the one it last had (affinity) but if that one's 
busy, it picks a new one. If none are available but we are below our max DWPT 
count, it spins up a new one?{quote}

Right.

{quote}With the proposed approach, all docs added (or in the process of being 
added) will make it into the flushed segments once the flush returns; newly 
added docs after the flush call started may or not make it. But this is fine? I 
mean, if the app has stronger requirements then it should externally 
sync?{quote}

Ok.  The proposed change is simply the thread calling add doc will flush it's 
DWPT if needed, take it offline while doing so, and return it when completed.  
I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile?

{quote}Why would we lose them? Wouldn't that DWPT just go back into rotation 
once the flush is done?{quote}

Yes, we just need to change the existing code a bit then.

However I think we may still need the global lock for close, eg, today we're 
preventing the user from adding docs during close, after this issue is merged 
that behavior would change?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1260:


Attachment: LUCENE-1260_defaultsim.patch

Here's a patch for the general case, and it also adds a warning
that you should set your similarity with Similarity.setDefault, especially if 
you omit norms.

We can backport this to 3.x

The other cases involve fake norms, which I think we should completely remove 
in trunk
with LUCENE-2846, then there is no longer an issue and we can remove the 
warning in trunk.


 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979160#action_12979160
 ] 

Uwe Schindler commented on LUCENE-1260:
---

bq. Here's a patch for the general case, and it also adds a warning that you 
should set your similarity with Similarity.setDefault, especially if you omit 
norms. 

Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979162#action_12979162
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

And there's the case of the thread calling flush doesn't yet have a DWPT, it's 
going to need to get one assigned to it, however the one assigned may not be 
the max ram consumer.  What'll we do then?  If the user explicitly called flush 
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however 
that gets hairy with wait notifies (almost like the global lock?).

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979164#action_12979164
 ] 

Robert Muir commented on LUCENE-1260:
-

bq. Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

I think this is totally what we should try to do in trunk, especially after 
LUCENE-2846.

In this case, i want to fix the issue in a backwards-compatible way for lucene 
3.x
The warning is a little crazy I know, really people shouldnt rely upon their 
encoder being used for *fake norms*.
But i think its fair to document the corner case, just because its not really 
fixable easily in 3.x

For trunk, here is what i suggest:
* LUCENE-2846: remove all uses of fake norms. We never fill fake norms anymore 
at all, once we fix this issue. If you have a non-atomic reader with two 
segments, and one has no norms, then the whole norms[] should be null. this is 
consistent with omitTF. So, for example MultiNorms would never create fake
norms.
* LUCENE-2854: Mike is working on some issues i think where BooleanQuery uses 
this static or some other silliness with Similarity, i think we can clean that 
up there.
* finally at this point, I would like to remove 
Similarity.getDefault/setDefault alltogether. I would prefer instead that 
IndexSearcher has a single 'DefaultSimilarity' that is the default value if you 
don't provide one, and likewise with IndexWriterConfig.


 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979174#action_12979174
 ] 

Michael McCandless commented on LUCENE-2854:


bq. Is it possible to remove this method Query.getSimilarity also? I don't 
understand why we need this method!

I would love to!  But I think that's for another day...

I looked into this and got stuck with BoostingQuery, which rewrites to an anon 
subclass of BQ overriding its getSimilarity in turn override its coord method.  
Rather twisted... if we can do this differently I think we could remove 
Query.getSimilarity.

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2828:
---

Fix Version/s: 3.0.4
   2.9.5

 SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
 --

 Key: LUCENE-2828
 URL: https://issues.apache.org/jira/browse/LUCENE-2828
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
Reporter: Michael McCandless
 Fix For: 2.9.5, 3.0.4

 Attachments: LUCENE-2828.patch


 In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
 have access to the raw information (length, boost, etc.).
 But this class broke back compat with SimilarityDelegator.  We did add 
 computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
 that method will no longer be invoked.
 Not quite sure how to fix this since, somehow, we have to determine whether 
 the delegee's impl of computeNorm should be favored over the subclasses impl 
 of the legacy lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979178#action_12979178
 ] 

Michael McCandless commented on LUCENE-2828:


We won't fix this for 3.x or 4.0, since we've deprecated SimilarityDelegator, 
and forced hard cutover from Sim.lengthNorm - Sim.computeNorm (LUCENE-2854).

But I'll leave this open in case we do another 2.9/3.0 release.

 SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
 --

 Key: LUCENE-2828
 URL: https://issues.apache.org/jira/browse/LUCENE-2828
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
Reporter: Michael McCandless
 Fix For: 2.9.5, 3.0.4

 Attachments: LUCENE-2828.patch


 In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
 have access to the raw information (length, boost, etc.).
 But this class broke back compat with SimilarityDelegator.  We did add 
 computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
 that method will no longer be invoked.
 Not quite sure how to fix this since, somehow, we have to determine whether 
 the delegee's impl of computeNorm should be favored over the subclasses impl 
 of the legacy lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2854.


Resolution: Fixed

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/

2011-01-08 Thread Chris Hostetter

:  +  public static final SetString EMPTY_STRING_SET = 
Collections.emptySet();
:  +
: 
: I don't know about this commit... i see a lot of EMPTY set's and maps
: defined statically here.
...
: I think we should be using the Collection methods, for example on your
: first file:

Hmmm... i am using the Collections method, it's the same set/map in each 
case, i'm just creating static ref's to them with the type information.  

My reading of the javadocs was that the implementation of emptySet() was 
going to just return the same immutable instance every time anyway, so 
there didn't seem to be any functional diff in reusing it like this -- it 
seemed like the natureal way to migrate from using Collections.EMPTY_SET,  
use our own local ref of the same object w/type info.

: -  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
: +  this(fieldName, fieldType, analyzer, Collections.StringemptySet());

Ah... see, i didn't even know that syntax was valid to bind the generic on 
a static method.  I'd only ever done the binding in the assignmet.  

yeah, sure -- i'll make a note to myself to go back and clean those up.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2288) clean up compiler warnings

2011-01-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979188#action_12979188
 ] 

Hoss Man commented on SOLR-2288:


Reminder to self: feedback from rmuir on the mailing list to replace the static 
EMPTY set/map refs w/type info that i added with direct usage like this...

-  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
+  this(fieldName, fieldType, analyzer, Collections.StringemptySet());


 clean up compiler warnings
 --

 Key: SOLR-2288
 URL: https://issues.apache.org/jira/browse/SOLR-2288
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man
Assignee: Hoss Man
 Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch


 there's a ton of compiler warning in the solr tree, and it's high time we 
 cleaned them up, or annotate them to be suppressed so we can start making a 
 bigger stink when/if code is added to the tree thta produces warnings (we'll 
 never do a good job of noticing new warnings when we have ~175 existing ones)
 Using this issue to track related commits
 The goal of this issue should not be to change any functionality or APIs, 
 just deal with each warning in the most appropriate way;
 * fix generic declarations
 * add SuppressWarning anotation if it's safe in context

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979190#action_12979190
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
And there's the case of the thread calling flush doesn't yet have a DWPT, it's 
going to need to get one assigned to it, however the one assigned may not be 
the max ram consumer. What'll we do then? If the user explicitly called flush 
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however 
that gets hairy with wait notifies (almost like the global lock?).
{quote}

Wait -- why would the thread calling flush need to have a DWPT assigned to it?  
You're talking about the flush the world case?  (Ie the app calls IW.commit 
or IW.getReader).  In this case the thread just one by one pulls all DWPTs that 
have any indexed docs out of production, flushes them, clears them, and returns 
them to production?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979189#action_12979189
 ] 

Michael McCandless commented on LUCENE-2324:


bq. The proposed change is simply the thread calling add doc will flush it's 
DWPT if needed, take it offline while doing so, and return it when completed.

Wait -- this is the addDocument case right?  (I thought we were still talking 
about the flush the world case...).

bq.  I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile?

A new DWPT will have been created only if more than one thread is indexing docs 
right?  In which case this is fine?  Ie the old DWPT (just flushed) will just 
go back into rotation, and when another thread comes in it can take it?

But, you're right: maybe we should sometimes prune DWPTs.  Or simply stop 
recycling any RAM, so that a just-flushed DWPT is an empty shell.

bq. However I think we may still need the global lock for close, eg, today 
we're preventing the user from adding docs during close, after this issue is 
merged that behavior would change?

Well, the threads still adding docs will hit AlreadyClosedException?  (But, 
that's just best effort).  The behavior of calling IW.close while other 
threads are still adding docs has never been defined (and, shouldn't be) except 
that we won't corrupt your index, and we'll get all docs indexed before .close 
was called, committed.  So I think even for this case we don't need a global 
lock.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2829) improve termquery pk lookup performance

2011-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2829:
---

Attachment: LUCENE-2829.patch

New patch.  I added VirtualMethods to Sim to make sure Sim subclasses that 
don't override idfExplain that takes docFreq are still called.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2829.patch, LUCENE-2829.patch, LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-236) Field collapsing

2011-01-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979193#action_12979193
 ] 

Samuel García Martínez commented on SOLR-236:
-

The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query 
cache and the signature key that is using to store cached results. 

To sum up, if you perform a filter query and then, you perform that query using 
collapse field, that query result is already cached, but not cached as expected 
by this component. Resulting that the DocSet implementation is not the expected 
one, and, as cached result, the DocumentCollector is not executed at any time.

As soon as i can ill post a patch using combined key to cache results, formed 
by the collector class and the query itself.

Colbenson - Findability Experts 
http://www.colbenson.es/



 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: Next

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
 field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
 quasidistributed.additional.patch, 
 SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, 
 SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
 SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979229#action_12979229
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}the flush the world case? (Ie the app calls IW.commit or
IW.getReader). In this case the thread just one by one pulls all DWPTs that
have any indexed docs out of production, flushes them, clears them, and returns
them to production?{quote}

The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and 
B) flush by RAM usage when adding docs or deleting. A is clear! I think with B
we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
greater in size and the aggregate RAM usage exceeds the max, using the calling
thread, we take DWPT #2 out of production, flush, and return it?

{quote}The behavior of calling IW.close while other threads are still adding
docs has never been defined (and, shouldn't be) except that we won't corrupt
your index, and we'll get all docs indexed before .close was called, committed.
So I think even for this case we don't need a global lock.{quote}

Great, that simplifies and clarifies that we do not require a global lock.

{quote}But, you're right: maybe we should sometimes prune DWPTs. Or simply
stop recycling any RAM, so that a just-flushed DWPT is an empty shell.{quote}

I'm not sure how we'd prune, typically object pools have a separate eviction
thread, I think that's going overboard? Maybe we can simply throw out the DWPT
and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?



 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979243#action_12979243
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

To further clarify, we also no longer have global aborts?  Each abort only 
applies to an individual DWPT?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979247#action_12979247
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. I think the risk is a new DWPT likely will have been created during flush, 
which'd make the returning DWPT inutile.

The DWPT will not be removed from the pool, just marked as busy during flush, 
like as its state is busy (or currently called non-idle in the code) during 
addDocumentI().  So no new DWPT would be created during flush if the 
maxThreadState limit was already reached.




 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979248#action_12979248
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I think start simple - the addDocument always happens? Ie it's never 
coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since 
flush is single threaded, there should always be a free DWPT?
{quote}

Yeah I agree.  The change I'll make then is to not have the global lock and 
return a DWPT immediately to the pool and set it to 'idle' after its flush 
completed.

{quote}
I think we should continue what we do today? Ie, if it's an 'aborting' 
exception, then the entire segment held by that DWPT is discarded? And we then 
throw this exc back to caller (and don't try to flush any other segments)?
{quote}

What I meant was the following situation: Suppose we have two DWPTs and 
IW.commit() is called.  The first DWPT finishes flushing successfully, is 
returned to the pool and idle again.  The second DWPT flush fails with an 
aborting exception.  Should the segment of the first DWPT make it into the 
index or not?  I think segment 1 shouldn't be committed, ie. a global flush 
should be all or nothing.  This means we would have to delay the commit of the 
segments until all DWPTs flushed successfully.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Adriano Crestani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adriano Crestani updated LUCENE-2855:
-

Attachment: lucene_2855_adriano_crestani_2011_01_08.patch

Here is the fix for the problem raised at thread [1]. The patch also includes a 
junit to make sure the problem doesn't show up again.

If there are no concerns in two days, I will go ahead and commit the patch.

[1] - http://lucene.markmail.org/thread/mbb5wlxttsa6sges

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979252#action_12979252
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}I think segment 1 shouldn't be committed, ie. a global flush should be 
all or nothing. This means we would have to delay the commit of the segments 
until all DWPTs flushed successfully.{quote}

If a DWPT aborts during flush, we simply throw an exception, however we still 
keep the successfully flushed segment(s).  If there's an abort on any DWPT 
during commit then we throw away any successfully flushed segments as well.  I 
think that makes sense, eg, all or nothing.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979253#action_12979253
 ] 

Uwe Schindler commented on LUCENE-2855:
---

+1 to commit.

In general, one should never use interfaces as keys in maps (as long as they 
don't declare the equals and hashcode methods inside the interface).

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979255#action_12979255
 ] 

Uwe Schindler commented on LUCENE-2855:
---

One thing in your patch: Lucene tests should always extend LuceneTestCase 
(which is Junit4)

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979264#action_12979264
 ] 

Lance Norskog commented on SOLR-2129:
-

bq.Don't want to at least log this? } catch (AnalysisEngineProcessException 
e) { // do nothing }

bq. I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be 
reasonable to log the error in this case (even if I don't like logging 
exceptions in general).

Please do not hide errors in any way. Nobody reads logs. If it fails in 
production, I want to know immediately and fix it. Please just throw all 
exceptions up the stack.

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 3533 - Failure

2011-01-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3533/

1 tests failed.
FAILED:  org.apache.lucene.util.TestVersion.testFilter

Error Message:
Forked Java VM exited abnormally. Please note the time in the report does not 
reflect the time until the VM exit.

Stack Trace:
junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please 
note the time in the report does not reflect the time until the VM exit.
at java.lang.Thread.run(Thread.java:636)




Build Log (for compile errors):
[...truncated 8470 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org