date:20110108


[ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979118#action_12979118
 ] 

Michael McCandless commented on LUCENE-2831:


bq. It seems we also need to migrate FieldComparator to use ReaderContext 
(eventually AtomicReaderContext)?

+1

And also Collector?

 Revise Weight#scorer  Filter#getDocIdSet API to pass Readers context
 -

 Key: LUCENE-2831
 URL: https://issues.apache.org/jira/browse/LUCENE-2831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
 LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch


 Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
 boolean, boolean) we should / could revise the API and pass in a struct that 
 has parent reader, sub reader, ord of that sub. The ord mapping plus the 
 context with its parent would make several issues way easier. See 
 LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

[
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-2854:
---

Attachment: LUCENE-2854.patch

I think we should simply make a hard break on the Sim.lengthNorm -
computeNorm cutover. Subclassing sim is an expert thing, and, I'd
rather apps see a compilation error on upgrade so that they realize
their lengthNorm wasn't being called this whole time because of
LUCENE-2828 (and that they must now cutover to computeNorm).

So I made lengthNorm final (and throws UOE), computeNorm abstract. I
deprecated SimilarityDelegator, and fixed BQ to not use it anymore.
The only other use is FuzzyLikeThisQuery, but fixing that is a little
too involved for today.

Deprecate SimilarityDelegator and Similarity.lengthNorm
---

Key: LUCENE-2854
URL: https://issues.apache.org/jira/browse/LUCENE-2854
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 3.1, 4.0

Attachments: LUCENE-2854.patch

SimilarityDelegator is a back compat trap (see LUCENE-2828).
Apps should just [statically] subclass Sim or DefaultSim; if they really need
runtime subclassing then they can make their own app-level delegator.
Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm
in favor of computeNorm.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright

From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
this question. ;-)

Karl


From: ext Robert Muir [rcm...@gmail.com]
Sent: Saturday, January 08, 2011 7:11 AM
To: dev@lucene.apache.org
Subject: Re: LICENSE/NOTICE file contents

You are probably right... the LICENSE.txt also contains many instances
of incorrect capitalization, I noticed that all versions of of this
file I can find anywhere have this problem :)

On Sat, Jan 8, 2011 at 6:14 AM,  karl.wri...@nokia.com wrote:
 This list might be interested to know that the current Solr LICENSE and 
 NOTICE file contents are not Apache standard.  The ManifoldCF project based 
 its LICENSE and NOTICE files on the Solr ones and got the following icy 
 reception in the incubator:


 The NOTICE file is still incorrect and includes a lot of unnecessary
 stuff. Understanding how to do releases with the correct legal files
 is one of the important parts of incubation and as this is the first
 release for the poddling i think this needs to be sorted out.

 For the NOTICE file, start with the following text (between the ---'s):

 ---
 Apache ManifestCF
 Copyright 2010 The Apache Software Foundation

 This product includes software developed by
 The Apache Software Foundation (http://www.apache.org/).
 ---

 and then add _nothing_ unless you can find explicit policy documented
 somewhere in the ASF that says it is required. If someone wants to add
 something ask for the URL where the requirement is documented. The
 NOTICE file should only include required notices, the other text thats
 in the current NOTICE file could go in a README file, see
 http://www.apache.org/legal/src-headers.html#notice

 For the LICENSE file, it should start with the AL as the current one
 does, and then include the text for all the other licenses used in the
 distribution. Those license that are currently in the NOTICE file
 should be moved to the LICENSE file and then you need to verify that
 all the 3rd party dependencies in the src and binary distributions are
 also in the LICENSE files of those distributions.

 

 Our NOTICE includes the following, which was taken from Solr (because we have 
 a similar dependency).  I'd like to know whether it is a valid thing to 
 include, and where it says that somewhere in Apache:


 =
 == Jetty Notice==
 =
 ==
  Jetty Web Container
  Copyright 1995-2006 Mort Bay Consulting Pty Ltd
 ==

 This product includes some software developed at The Apache Software
 Foundation (http://www.apache.org/).

 The javax.servlet package used by Jetty is copyright
 Sun Microsystems, Inc and Apache Software Foundation. It is
 distributed under the Common Development and Distribution License.
 You can obtain a copy of the license at
 https://glassfish.dev.java.net/public/CDDLv1.0.html.

 The UnixCrypt.java code ~Implements the one way cryptography used by
 Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
 modified April 2001  by Iris Van den Broeke, Daniel Deville.

 The default JSP implementation is provided by the Glassfish JSP engine
 from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.

 Some portions of the code are Copyright:
  2006 Tim Vernum
  1999 Jason Gilbert.

 The jboss integration module contains some LGPL code.

 =
 == HSQLDB Notice   ==
 =

 For content, code, and products originally developed by Thomas Mueller and 
 the Hypersonic SQL Group:

 Copyright (c) 1995-2000 by the Hypersonic SQL Group.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:

 Redistributions of source code must retain the above copyright notice, this
 list of conditions and the following disclaimer.

 Redistributions in binary form must reproduce the above copyright notice,
 this list of conditions and the following disclaimer in the documentation
 and/or other materials provided with the distribution.

 Neither the name of the Hypersonic SQL Group nor the names of its
 contributors may be used to endorse or promote products derived from this
 software without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE

Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Grant Ingersoll

The weird thing is, all of our collectors, IMO, are optimized for the 
non-paging scenario, when I would venture to guess that the very large majority 
of users out there do paging.  AFAICT, about the only people who don't do 
paging are those who do deep, downstream analysis which requires them to 
retrieve 100's or 1000's or more of results at a time (I've seen as much as a 
million used in production) as part of a batch job.

See https://issues.apache.org/jira/browse/LUCENE-2215 and 
https://issues.apache.org/jira/browse/SOLR-1726 for the issues tracking this.

-Grant

On Jan 8, 2011, at 7:11 AM, Earwin Burrfoot wrote:

 On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
 Cominventjan@cominvent.com wrote:
 The problem with large start is probably worse when sharding is involved. 
 Anyone know how the shard component goes about fetching 
 start=100rows=10 from say 10 shards? Does it have to merge sorted 
 lists of 1mill+10 docsids from each shard which is the worst case?
 
 Yep, that's how it works today.
 
 
 Technically, if your docs have a non-biased (in regards to their
 sort-value) distribution across shards, you can fetch much less than
 topN docs from each shard.
 I played with the idea, and it worked for me. Though later I dropped
 the opto, as it complicated things somewhat and my users aren't
 querying gazillions of docs often.
 
 
 -- 
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Phone: +7 (495) 683-567-4
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm


[ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979128#action_12979128
 ] 

Michael McCandless commented on LUCENE-2854:


The above patch applies to 3.x

For trunk I plan to remove SimliarityDelegator from core, and move it 
(deprecated) into contrib/queries/... (private to FuzzyLikeThisQ).  At some 
point [later] we can fix FuzzyLikeThisQ to not use it...

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979129#action_12979129
]

Michael McCandless commented on LUCENE-2324:

bq. I guess we don't really need the global lock. A thread performing the
global flush could still acquire each thread state before it starts flushing,
but return a threadState to the pool once that particular threadState is done
flushing?

Good question... we could (in theory) also flush them concurrently? But, since
we don't own the threads in IW, we can't easily do that, so I think no global
lock, go through all DWPTs w/ current thread and flush, sequentially? So all
that's guaranteed after the global flush() returns is that all state present
prior to when flush() is invoked, is moved to disk. Ie if addDocs are still
happening concurrently then the DWPTs will start filling up again even while
the global flush runs. That's fine.

{quote}

A related question is: Do we want to piggyback on multiple threads when a
global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards
addDocument(). When should addDocument() happen?
a) After all DWPTs finished flushing?
b) After at least one DWPT finished flushing and is available again?
c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

a) is currently implemented, but I think not really what we want.
b) is probably best for RT, because it means the lowest indexing latency for
the new document to be added.
c) probably means the best overall throughput (depending even on hardware like
disk speed, etc)
{quote}

I think start simple -- the addDocument always happens? Ie it's never
coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since
flush is single threaded, there should always be a free DWPT?

Longer term c) would be great, or, if IW has an ES then it'd send multiple
flush jobs to the ES.

{quote}
For whatever option we pick, we'll have to carefully think about error
handling. It's quite straightforward for a) (just commit all flushed segments
to SegmentInfos when the global flush completed succesfully). But for b) and c)
it's unclear what should happen if a DWPT flush fails after some completed
already successfully before.
{quote}

I think we should continue what we do today? Ie, if it's an 'aborting'
exception, then the entire segment held by that DWPT is discarded? And we then
throw this exc back to caller (and don't try to flush any other segments)?

Per thread DocumentsWriters that write their own private segments
-

Key: LUCENE-2324
URL: https://issues.apache.org/jira/browse/LUCENE-2324
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: Realtime Branch

Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch,
LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch,
lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out

See LUCENE-2293 for motivation and more details.
I'm copying here Mike's summary he posted on 2293:
Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
normal segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO CPU.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: LICENSE/NOTICE file contents

2011-01-08 Thread Grant Ingersoll

Because they are shipped with Solr.  I don't see why it hurts to give people 
information about what's in the download.


On Jan 8, 2011, at 8:10 AM, karl.wri...@nokia.com karl.wri...@nokia.com 
wrote:

 From svn, Yonik seems to be the go-to guy for LICENSE and NOTICE stuff.  
 Yonik, do you remember why the HSQLDB and Jetty notice text was included in 
 Solr's NOTICE.txt?  The incubator won't release ManifoldCF until we answer 
 this question. ;-)
 
 Karl
 
 
 From: ext Robert Muir [rcm...@gmail.com]
 Sent: Saturday, January 08, 2011 7:11 AM
 To: dev@lucene.apache.org
 Subject: Re: LICENSE/NOTICE file contents
 
 You are probably right... the LICENSE.txt also contains many instances
 of incorrect capitalization, I noticed that all versions of of this
 file I can find anywhere have this problem :)
 
 On Sat, Jan 8, 2011 at 6:14 AM,  karl.wri...@nokia.com wrote:
 This list might be interested to know that the current Solr LICENSE and 
 NOTICE file contents are not Apache standard.  The ManifoldCF project based 
 its LICENSE and NOTICE files on the Solr ones and got the following icy 
 reception in the incubator:
 
 
 The NOTICE file is still incorrect and includes a lot of unnecessary
 stuff. Understanding how to do releases with the correct legal files
 is one of the important parts of incubation and as this is the first
 release for the poddling i think this needs to be sorted out.
 
 For the NOTICE file, start with the following text (between the ---'s):
 
 ---
 Apache ManifestCF
 Copyright 2010 The Apache Software Foundation
 
 This product includes software developed by
 The Apache Software Foundation (http://www.apache.org/).
 ---
 
 and then add _nothing_ unless you can find explicit policy documented
 somewhere in the ASF that says it is required. If someone wants to add
 something ask for the URL where the requirement is documented. The
 NOTICE file should only include required notices, the other text thats
 in the current NOTICE file could go in a README file, see
 http://www.apache.org/legal/src-headers.html#notice
 
 For the LICENSE file, it should start with the AL as the current one
 does, and then include the text for all the other licenses used in the
 distribution. Those license that are currently in the NOTICE file
 should be moved to the LICENSE file and then you need to verify that
 all the 3rd party dependencies in the src and binary distributions are
 also in the LICENSE files of those distributions.
 
 
 
 Our NOTICE includes the following, which was taken from Solr (because we 
 have a similar dependency).  I'd like to know whether it is a valid thing to 
 include, and where it says that somewhere in Apache:
 
 
 =
 == Jetty Notice==
 =
 ==
 Jetty Web Container
 Copyright 1995-2006 Mort Bay Consulting Pty Ltd
 ==
 
 This product includes some software developed at The Apache Software
 Foundation (http://www.apache.org/).
 
 The javax.servlet package used by Jetty is copyright
 Sun Microsystems, Inc and Apache Software Foundation. It is
 distributed under the Common Development and Distribution License.
 You can obtain a copy of the license at
 https://glassfish.dev.java.net/public/CDDLv1.0.html.
 
 The UnixCrypt.java code ~Implements the one way cryptography used by
 Unix systems for simple password protection.  Copyright 1996 Aki Yoshida,
 modified April 2001  by Iris Van den Broeke, Daniel Deville.
 
 The default JSP implementation is provided by the Glassfish JSP engine
 from project Glassfish http://glassfish.dev.java.net.  Copyright 2005
 Sun Microsystems, Inc. and portions Copyright Apache Software Foundation.
 
 Some portions of the code are Copyright:
 2006 Tim Vernum
 1999 Jason Gilbert.
 
 The jboss integration module contains some LGPL code.
 
 =
 == HSQLDB Notice   ==
 =
 
 For content, code, and products originally developed by Thomas Mueller and 
 the Hypersonic SQL Group:
 
 Copyright (c) 1995-2000 by the Hypersonic SQL Group.
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:
 
 Redistributions of source code must retain the above copyright notice, this
 list of conditions and the following disclaimer.
 
 Redistributions in binary form must reproduce the above copyright notice,
 this list of conditions and the following disclaimer in the documentation
 and/or other materials provided with the distribution.

Re: LICENSE/NOTICE file contents

2011-01-08 Thread Robert Muir

On Sat, Jan 8, 2011 at 10:06 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 There also wasn't any business about and then add _nothing_ unless
 you can find explicit policy documented
 somewhere in the ASF that says it is required.  I was following
 examples from other projects and any docs I could find at the time,
 but this was back in '06.


Not sure there is now either, this is likely just someone's opinion.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979138#action_12979138
]

Jason Rutherglen commented on LUCENE-2324:
--

{quote}So all that's guaranteed after the global flush() returns is that all
state present prior to when flush() is invoked, is moved to disk. Ie if addDocs
are still happening concurrently then the DWPTs will start filling up again
even while the global flush runs. That's fine.{quote}

What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?

Also I think we need to clear the thread bindings of a DWPT just prior to the
flush of the DWPT? Otherwise (when multiple threads are mapped to a single
DWPT) the other threads will wait on the [main] DWPT flush when they should be
spinning up a new DWPT?

Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs
or will that noticeably affect indexing performance?

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: LICENSE/NOTICE file contents

2011-01-08 Thread karl.wright



Nope - wasn't me that added the license stuff into NOTICE.txt ;-)
But, including Jetty's NOTICE seems appropriate for our NOTICE.  It's
just the license parts of the HSQLDB and SLF4J that should be moved to
LICENSE.txt


The NOTICE text is actually different from the LICENSE text for HSQLDB, which 
is why I thought it must have come from an HSQLDB NOTICE file.

Karl


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979139#action_12979139
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also, don't we need the global lock for commit/close?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm


 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2854:


Attachment: LUCENE-2854_fuzzylikethis.patch

here is the patch for fuzzylikethis for trunk... so you can remove the 
delegator completely in trunk.


 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

[
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979141#action_12979141
]

Robert Muir commented on LUCENE-2854:
-

Is it possible to remove this method Query.getSimilarity also? I don't
understand why we need this method!

{noformat}
/** Expert: Returns the Similarity implementation to be used for this query.
* Subclasses may override this method to specify their own Similarity
* implementation, perhaps one that delegates through that of the Searcher.
* By default the Searcher's Similarity implementation is returned.*/
{noformat}

Deprecate SimilarityDelegator and Similarity.lengthNorm
---

Key: LUCENE-2854
URL: https://issues.apache.org/jira/browse/LUCENE-2854
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 3.1, 4.0

Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-3.x - Build # 3511 - Failure

2011-01-08 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3511/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety

Error Message:
unable to create new native thread

Stack Trace:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:614)
at 
org.apache.lucene.search.TestThreadSafe.doTest(TestThreadSafe.java:133)
at 
org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety(TestThreadSafe.java:152)
at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:255)




Build Log (for compile errors):
[...truncated 8566 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version-5.patch

Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979144#action_12979144
 ] 

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:


Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

  was (Author: teofili):
Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.
  
 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979144#action_12979144
]

Tommaso Teofili edited comment on SOLR-2129 at 1/8/11 11:09 AM:

Changes are:
- Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
- Make the getAE method in OverridingParamAEProvider synchronized to support
concurrent requests to the provider.
- Make the getAEProvider method in AEProviderFactory synchronized and make the
cache core aware, each core has now an AEProvider for each analysis engine's
path.
- The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

was (Author: teofili):
Changes are:
# Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
# Make the getAE method in OverridingParamAEProvider synchronized to support
concurrent requests to the provider.
# Make the getAEProvider method in AEProviderFactory synchronized and make the
cache core aware, each core has now an AEProvider for each analysis engine's
path.
# The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter
instead of a SolrConfig object.

I tested it with multiple cores and concurrent updates for each core.

Provide a Solr module for dynamic metadata extraction/indexing with Apache
UIMA
---

Key: SOLR-2129
URL: https://issues.apache.org/jira/browse/SOLR-2129
Project: Solr
Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch,
SOLR-2129-version-5.patch, SOLR-2129-version2.patch,
SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch

Provide components to enable Apache UIMA automatic metadata extraction to be
exploited when indexing documents.
The purpose of this is to get unstructured information inside a document
and create structured metadata (as fields) to enrich each document.
Basically this can be done with a custom UpdateRequestProcessor which
triggers UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences
(with a tokenizer and an hidden Markov model tagger), named entities,
language, suggested category, keywords and concepts (exploiting external
services from OpenCalais and AlchemyAPI). Such an implementation can be
easily extended adding or selecting different UIMA analysis engines, both
from UIMA repositories on the web or creating new ones from scratch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979146#action_12979146
]

Michael McCandless commented on LUCENE-2324:

{quote}
What if the user wants a guaranteed hard flush of all state up to the point of
the flush call (won't they want this sometimes with getReader)? If we're
flushing sequentially (without pausing all threads) we're removing that? Maybe
we'll need to give the option of global lock/stop or sequential flush?
{quote}

What's a hard flush?

With the proposed approach, all docs added (or in the process of being added)
will make it into the flushed segments once the flush returns; newly added docs
after the flush call started may or not make it. But this is fine? I mean, if
the app has stronger requirements then it should externally sync?

bq. Also I think we need to clear the thread bindings of a DWPT just prior to
the flush of the DWPT?

Right.

As soon as a DWPT is pulled from production for flushing, it loses all thread
affinity and becomes unavailable until its flush finishes. When a thread needs
a DWPT, it tries to pick the one it last had (affinity) but if that one's busy,
it picks a new one. If none are available but we are below our max DWPT count,
it spins up a new one?

{quote}
Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
recycling?
{quote}

Why would we lose them? Wouldn't that DWPT just go back into rotation once the
flush is done?

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979149#action_12979149
]

Jason Rutherglen commented on LUCENE-2324:
--

{quote}As soon as a DWPT is pulled from production for flushing, it loses all
thread affinity and becomes unavailable until its flush finishes. When a thread
needs a DWPT, it tries to pick the one it last had (affinity) but if that one's
busy, it picks a new one. If none are available but we are below our max DWPT
count, it spins up a new one?{quote}

Right.

{quote}With the proposed approach, all docs added (or in the process of being
added) will make it into the flushed segments once the flush returns; newly
added docs after the flush call started may or not make it. But this is fine? I
mean, if the app has stronger requirements then it should externally
sync?{quote}

Ok. The proposed change is simply the thread calling add doc will flush it's
DWPT if needed, take it offline while doing so, and return it when completed.
I think the risk is a new DWPT likely will have been created during flush,
which'd make the returning DWPT inutile?

{quote}Why would we lose them? Wouldn't that DWPT just go back into rotation
once the flush is done?{quote}

Yes, we just need to change the existing code a bit then.

However I think we may still need the global lock for close, eg, today we're
preventing the user from adding docs during close, after this issue is merged
that behavior would change?

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1260) Norm codec strategy in Similarity


 [ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1260:


Attachment: LUCENE-1260_defaultsim.patch

Here's a patch for the general case, and it also adds a warning
that you should set your similarity with Similarity.setDefault, especially if 
you omit norms.

We can backport this to 3.x

The other cases involve fake norms, which I think we should completely remove 
in trunk
with LUCENE-2846, then there is no longer an issue and we can remove the 
warning in trunk.


 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2011-01-08 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979160#action_12979160
 ] 

Uwe Schindler commented on LUCENE-1260:
---

bq. Here's a patch for the general case, and it also adds a warning that you 
should set your similarity with Similarity.setDefault, especially if you omit 
norms. 

Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979162#action_12979162
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

And there's the case of the thread calling flush doesn't yet have a DWPT, it's 
going to need to get one assigned to it, however the one assigned may not be 
the max ram consumer.  What'll we do then?  If the user explicitly called flush 
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however 
that gets hairy with wait notifies (almost like the global lock?).

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity


[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979164#action_12979164
 ] 

Robert Muir commented on LUCENE-1260:
-

bq. Is there no way to remove this stupid static default and deprecate 
Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for 
the case of NormsWriter?

I think this is totally what we should try to do in trunk, especially after 
LUCENE-2846.

In this case, i want to fix the issue in a backwards-compatible way for lucene 
3.x
The warning is a little crazy I know, really people shouldnt rely upon their 
encoder being used for *fake norms*.
But i think its fair to document the corner case, just because its not really 
fixable easily in 3.x

For trunk, here is what i suggest:
* LUCENE-2846: remove all uses of fake norms. We never fill fake norms anymore 
at all, once we fix this issue. If you have a non-atomic reader with two 
segments, and one has no norms, then the whole norms[] should be null. this is 
consistent with omitTF. So, for example MultiNorms would never create fake
norms.
* LUCENE-2854: Mike is working on some issues i think where BooleanQuery uses 
this static or some other silliness with Similarity, i think we can clean that 
up there.
* finally at this point, I would like to remove 
Similarity.getDefault/setDefault alltogether. I would prefer instead that 
IndexSearcher has a single 'DefaultSimilarity' that is the default value if you 
don't provide one, and likewise with IndexWriterConfig.


 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: Lucene-1260-1.patch, Lucene-1260-2.patch, 
 Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt, 
 LUCENE-1260_defaultsim.patch


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm

[
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979174#action_12979174
]

Michael McCandless commented on LUCENE-2854:

bq. Is it possible to remove this method Query.getSimilarity also? I don't
understand why we need this method!

I would love to! But I think that's for another day...

I looked into this and got stuck with BoostingQuery, which rewrites to an anon
subclass of BQ overriding its getSimilarity in turn override its coord method.
Rather twisted... if we can do this differently I think we could remove
Query.getSimilarity.

Deprecate SimilarityDelegator and Similarity.lengthNorm
---

Key: LUCENE-2854
URL: https://issues.apache.org/jira/browse/LUCENE-2854
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 3.1, 4.0

Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm


 [ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2828:
---

Fix Version/s: 3.0.4
   2.9.5

 SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
 --

 Key: LUCENE-2828
 URL: https://issues.apache.org/jira/browse/LUCENE-2828
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
Reporter: Michael McCandless
 Fix For: 2.9.5, 3.0.4

 Attachments: LUCENE-2828.patch


 In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
 have access to the raw information (length, boost, etc.).
 But this class broke back compat with SimilarityDelegator.  We did add 
 computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
 that method will no longer be invoked.
 Not quite sure how to fix this since, somehow, we have to determine whether 
 the delegee's impl of computeNorm should be favored over the subclasses impl 
 of the legacy lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2828) SimilarityDelegator broke back-compat for subclasses overriding lengthNorm


[ 
https://issues.apache.org/jira/browse/LUCENE-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979178#action_12979178
 ] 

Michael McCandless commented on LUCENE-2828:


We won't fix this for 3.x or 4.0, since we've deprecated SimilarityDelegator, 
and forced hard cutover from Sim.lengthNorm - Sim.computeNorm (LUCENE-2854).

But I'll leave this open in case we do another 2.9/3.0 release.

 SimilarityDelegator broke back-compat for subclasses overriding lengthNorm
 --

 Key: LUCENE-2828
 URL: https://issues.apache.org/jira/browse/LUCENE-2828
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3
Reporter: Michael McCandless
 Fix For: 2.9.5, 3.0.4

 Attachments: LUCENE-2828.patch


 In LUCENE-1420, we added Similarity.computeNorm to let the norm computation 
 have access to the raw information (length, boost, etc.).
 But this class broke back compat with SimilarityDelegator.  We did add 
 computeNorm there, but, it's impl just forwards to the delegee's computeNorm. 
  In the case where a subclass of SimilarityDelegator overrides lengthNorm, 
 that method will no longer be invoked.
 Not quite sure how to fix this since, somehow, we have to determine whether 
 the delegee's impl of computeNorm should be favored over the subclasses impl 
 of the legacy lengthNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2854) Deprecate SimilarityDelegator and Similarity.lengthNorm


 [ 
https://issues.apache.org/jira/browse/LUCENE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2854.


Resolution: Fixed

 Deprecate SimilarityDelegator and Similarity.lengthNorm
 ---

 Key: LUCENE-2854
 URL: https://issues.apache.org/jira/browse/LUCENE-2854
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2854.patch, LUCENE-2854_fuzzylikethis.patch


 SimilarityDelegator is a back compat trap (see LUCENE-2828).
 Apps should just [statically] subclass Sim or DefaultSim; if they really need 
 runtime subclassing then they can make their own app-level delegator.
 Also, Sim.computeNorm subsumes lengthNorm, so we should deprecate lengthNorm 
 in favor of computeNorm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1056612 - in /lucene/dev/trunk/solr/src/java/org/apache/solr: handler/ handler/component/ request/ search/

2011-01-08 Thread Chris Hostetter


:  +  public static final SetString EMPTY_STRING_SET = 
Collections.emptySet();
:  +
: 
: I don't know about this commit... i see a lot of EMPTY set's and maps
: defined statically here.
...
: I think we should be using the Collection methods, for example on your
: first file:

Hmmm... i am using the Collections method, it's the same set/map in each 
case, i'm just creating static ref's to them with the type information.  

My reading of the javadocs was that the implementation of emptySet() was 
going to just return the same immutable instance every time anyway, so 
there didn't seem to be any functional diff in reusing it like this -- it 
seemed like the natureal way to migrate from using Collections.EMPTY_SET,  
use our own local ref of the same object w/type info.

: -  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
: +  this(fieldName, fieldType, analyzer, Collections.StringemptySet());

Ah... see, i didn't even know that syntax was valid to bind the generic on 
a static method.  I'd only ever done the binding in the assignmet.  

yeah, sure -- i'll make a note to myself to go back and clean those up.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2288) clean up compiler warnings

2011-01-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979188#action_12979188
 ] 

Hoss Man commented on SOLR-2288:


Reminder to self: feedback from rmuir on the mailing list to replace the static 
EMPTY set/map refs w/type info that i added with direct usage like this...

-  this(fieldName, fieldType, analyzer, EMPTY_STRING_SET);
+  this(fieldName, fieldType, analyzer, Collections.StringemptySet());


 clean up compiler warnings
 --

 Key: SOLR-2288
 URL: https://issues.apache.org/jira/browse/SOLR-2288
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man
Assignee: Hoss Man
 Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch


 there's a ton of compiler warning in the solr tree, and it's high time we 
 cleaned them up, or annotate them to be suppressed so we can start making a 
 bigger stink when/if code is added to the tree thta produces warnings (we'll 
 never do a good job of noticing new warnings when we have ~175 existing ones)
 Using this issue to track related commits
 The goal of this issue should not be to change any functionality or APIs, 
 just deal with each warning in the most appropriate way;
 * fix generic declarations
 * add SuppressWarning anotation if it's safe in context

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979190#action_12979190
]

Michael McCandless commented on LUCENE-2324:

{quote}
And there's the case of the thread calling flush doesn't yet have a DWPT, it's
going to need to get one assigned to it, however the one assigned may not be
the max ram consumer. What'll we do then? If the user explicitly called flush
we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however
that gets hairy with wait notifies (almost like the global lock?).
{quote}

Wait -- why would the thread calling flush need to have a DWPT assigned to it?
You're talking about the flush the world case? (Ie the app calls IW.commit
or IW.getReader). In this case the thread just one by one pulls all DWPTs that
have any indexed docs out of production, flushes them, clears them, and returns
them to production?

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979189#action_12979189
]

Michael McCandless commented on LUCENE-2324:

bq. The proposed change is simply the thread calling add doc will flush it's
DWPT if needed, take it offline while doing so, and return it when completed.

Wait -- this is the addDocument case right? (I thought we were still talking
about the flush the world case...).

bq. I think the risk is a new DWPT likely will have been created during flush,
which'd make the returning DWPT inutile?

A new DWPT will have been created only if more than one thread is indexing docs
right? In which case this is fine? Ie the old DWPT (just flushed) will just
go back into rotation, and when another thread comes in it can take it?

But, you're right: maybe we should sometimes prune DWPTs. Or simply stop
recycling any RAM, so that a just-flushed DWPT is an empty shell.

bq. However I think we may still need the global lock for close, eg, today
we're preventing the user from adding docs during close, after this issue is
merged that behavior would change?

Well, the threads still adding docs will hit AlreadyClosedException? (But,
that's just best effort). The behavior of calling IW.close while other
threads are still adding docs has never been defined (and, shouldn't be) except
that we won't corrupt your index, and we'll get all docs indexed before .close
was called, committed. So I think even for this case we don't need a global
lock.

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2829) improve termquery pk lookup performance


 [ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2829:
---

Attachment: LUCENE-2829.patch

New patch.  I added VirtualMethods to Sim to make sure Sim subclasses that 
don't override idfExplain that takes docFreq are still called.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2829.patch, LUCENE-2829.patch, LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-236) Field collapsing

2011-01-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979193#action_12979193
 ] 

Samuel García Martínez commented on SOLR-236:
-

The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query 
cache and the signature key that is using to store cached results. 

To sum up, if you perform a filter query and then, you perform that query using 
collapse field, that query result is already cached, but not cached as expected 
by this component. Resulting that the DocSet implementation is not the expected 
one, and, as cached result, the DocumentCollector is not executed at any time.

As soon as i can ill post a patch using combined key to cache results, formed 
by the collector class and the query itself.

Colbenson - Findability Experts 
http://www.colbenson.es/



 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: Next

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
 field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
 quasidistributed.additional.patch, 
 SOLR-236-1_4_1-paging-totals-working.patch, SOLR-236-1_4_1.patch, 
 SOLR-236-distinctFacet.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
 SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979229#action_12979229
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}the flush the world case? (Ie the app calls IW.commit or
IW.getReader). In this case the thread just one by one pulls all DWPTs that
have any indexed docs out of production, flushes them, clears them, and returns
them to production?{quote}

The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and 
B) flush by RAM usage when adding docs or deleting. A is clear! I think with B
we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
greater in size and the aggregate RAM usage exceeds the max, using the calling
thread, we take DWPT #2 out of production, flush, and return it?

{quote}The behavior of calling IW.close while other threads are still adding
docs has never been defined (and, shouldn't be) except that we won't corrupt
your index, and we'll get all docs indexed before .close was called, committed.
So I think even for this case we don't need a global lock.{quote}

Great, that simplifies and clarifies that we do not require a global lock.

{quote}But, you're right: maybe we should sometimes prune DWPTs. Or simply
stop recycling any RAM, so that a just-flushed DWPT is an empty shell.{quote}

I'm not sure how we'd prune, typically object pools have a separate eviction
thread, I think that's going overboard? Maybe we can simply throw out the DWPT
and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?



 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979243#action_12979243
]

Jason Rutherglen commented on LUCENE-2324:
--

To further clarify, we also no longer have global aborts? Each abort only
applies to an individual DWPT?

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979247#action_12979247
]

Michael Busch commented on LUCENE-2324:
---

bq. I think the risk is a new DWPT likely will have been created during flush,
which'd make the returning DWPT inutile.

The DWPT will not be removed from the pool, just marked as busy during flush,
like as its state is busy (or currently called non-idle in the code) during
addDocumentI(). So no new DWPT would be created during flush if the
maxThreadState limit was already reached.

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-08 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979248#action_12979248
]

Michael Busch commented on LUCENE-2324:
---

{quote}
I think start simple - the addDocument always happens? Ie it's never
coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since
flush is single threaded, there should always be a free DWPT?
{quote}

Yeah I agree. The change I'll make then is to not have the global lock and
return a DWPT immediately to the pool and set it to 'idle' after its flush
completed.

{quote}
I think we should continue what we do today? Ie, if it's an 'aborting'
exception, then the entire segment held by that DWPT is discarded? And we then
throw this exc back to caller (and don't try to flush any other segments)?
{quote}

What I meant was the following situation: Suppose we have two DWPTs and
IW.commit() is called. The first DWPT finishes flushing successfully, is
returned to the pool and idle again. The second DWPT flush fails with an
aborting exception. Should the segment of the first DWPT make it into the
index or not? I think segment 1 shouldn't be committed, ie. a global flush
should be all or nothing. This means we would have to delay the commit of the
segments until all DWPTs flushed successfully.

Per thread DocumentsWriters that write their own private segments
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2855) Contrib queryparser should not use CharSequence as Map key

2011-01-08 Thread Adriano Crestani (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adriano Crestani updated LUCENE-2855:
-

Attachment: lucene_2855_adriano_crestani_2011_01_08.patch

Here is the fix for the problem raised at thread [1]. The patch also includes a 
junit to make sure the problem doesn't show up again.

If there are no concerns in two days, I will go ahead and commit the patch.

[1] - http://lucene.markmail.org/thread/mbb5wlxttsa6sges

 Contrib queryparser should not use CharSequence as Map key
 --

 Key: LUCENE-2855
 URL: https://issues.apache.org/jira/browse/LUCENE-2855
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0.3
Reporter: Adriano Crestani
Assignee: Adriano Crestani
 Fix For: 3.0.4

 Attachments: lucene_2855_adriano_crestani_2011_01_08.patch


 Today, contrib query parser uses MapCharSequence,... in many different 
 places, which may lead to problems, since CharSequence interface does not 
 enforce the implementation of hashcode and equals methods. Today, it's 
 causing a problem with QueryTreeBuilder.setBuilder(CharSequence,QueryBuilder) 
 method, that does not works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments