RE: [Possibly spoofed] Re: Lucene/Solr 6.0.0 Release Branch

2016-03-08 Thread Vanlerberghe, Luc
Hi,

I added two JIRA issues (Lucene: 
https://issues.apache.org/jira/browse/LUCENE-7078, Solr: 
https://issues.apache.org/jira/browse/SOLR-8802 ) concerning Query classes that 
are still mutable and should either become immutable, marked as 
@lucene.experimental or get a comment why it’s not an issue for that case.

Since they are part of the public API, I think now is the time to update them.

I already converted MultiPhraseQuery 
(https://issues.apache.org/jira/browse/LUCENE-7064: reviewed and committed by 
Adrien Grand).

Luc Vanlerberghe

From: Joel Bernstein [mailto:joels...@gmail.com]
Sent: maandag 7 maart 2016 21:08
To: lucene dev
Subject: [Possibly spoofed] Re: Lucene/Solr 6.0.0 Release Branch

"Major API and bug fixes (no features) can be committed without my approval 
before Friday as long as they're reviewed and approved by another committer."

Hmmm, are there really major API changes underway at this point? As far as bug 
fixes needing another committer approval is not something we've done in the 
past.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 7, 2016 at 2:54 PM, Nicholas Knize 
> wrote:
I think with all of the volatility surrounding the new Points codec that 
obvious bug/stability patches like these are OK? I know several folks have been 
working feverishly the past few days to fix serious (and simplify) 6.0 issues 
and squash all of the jenkins failures to ensure stability in time for the 
major release. That being said, you're right that we don't want chaotic 
committing as we lead up to the release.

So unless there are no objections I'll plan to move forward and start the 
release process this Friday. Until then, since this is a major release, as many 
people we can get to scrutinize and stabilize 6_0 over the next 3-4 days the 
better. Major API and bug fixes (no features) can be committed without my 
approval before Friday as long as they're reviewed and approved by another 
committer. If there is any uncertainty ping me on this thread or the 
corresponding issue and I'll review. I will also send out an email 24 hours 
before I start the release process.

- Nick


On Mon, Mar 7, 2016 at 9:04 AM, 
david.w.smi...@gmail.com 
> wrote:
I just want to clarify you(Nick) / our expectations about this release branch.  
It seems, based on issues I've seen like LUCENE-7072, that we can commit to the 
release branch without your permission as RM.  If this is true, then I presume 
the tacit approval is okay so long as it's not a new feature.  Right?

On Wed, Feb 24, 2016 at 3:23 PM Nicholas Knize 
> wrote:
With the release of 5.5 and the previous discussion re: 6.0.0 I'd like to keep 
the ball moving and volunteer as the 6.0.0 RM.

If there are no objections my plan is to cut branch_6_0 early next week - Mon 
or Tues. Please mark blocker issues accordingly and/or let me know if there are 
any commits needed before cutting the branch.

- Nick
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book: 
http://www.solrenterprisesearchserver.com




Re: Lucene/Solr 6.0.0 Release Branch

2016-03-04 Thread Vanlerberghe, Luc
Hi,

With the recent switch to git and the debate on branching for 6.x going on, may 
I suggest taking a look at the git workflow of the apache Cassandra project?
http://wiki.apache.org/cassandra/HowToContribute#Committing
(There's a link to a Google TechTalk video with an in-depth explanation too)

They currently have 4 (4!) supported releases, each in their own branch, and 
still keep everything manageable.

Basically, they apply changes at the lowest release for which the change is 
applicable and then merge the change upwards to the other releases and 
eventually to trunk.

IMHO it has the following advantages:
- A fix for an issue (even if it would be a long-running side-branch) occurs 
only once in the full project log.
- No more cherry-picking
- Instead of having parallel logs per version, it's immediately clear which 
fixes are applied to which version
- No fixes for a lower version can be forgotten in a higher version (the 
committer of the next fix would notice)
- Back-porting is still possible if really necessary by cherry-picking. The 
merge-upward would be a no-op, but still make clear the versions are in sync.
- If you have a graphical tool to visualize the log (even if only using ascii 
art), it looks beautiful :)

If you want to take a look, their git repository is at: 
http://git-wip-us.apache.org/repos/asf/cassandra.git

Luc


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: donderdag 3 maart 2016 17:30
To: dev@lucene.apache.org
Subject: [Possibly spoofed] Re: Lucene/Solr 6.0.0 Release Branch

Mike,

I'll fix the TestBackwardsCompatibility. The mistake was to freeze the
6x branch in the first place. The release branch is the one which
should be frozen. I specifically asked the RM to cut the branch to let
others progress but I received no replies -- which is why I was forced
to do it myself. In future, the RM should keep this in mind and not
block others. The rest of the problem was because I am new to Git --
in subversion a release branch is always copied from the server so
pulling latest changes locally before creating the branch did not
cross my mind.

On Thu, Mar 3, 2016 at 9:46 PM, Michael McCandless
 wrote:
> Shalin,
>
> In the future please don't jump the gun like this?
>
> It has caused a lot of unnecessary chaos.  It should be the RM, and
> only the RM, that is doing things like creating release branches,
> bumping versions, etc., at release time.
>
> Also, your changes to bump the version on 6.x seem to be causing
> TestBackwardsCompatibility to be failing.  Can you please fix that?
> In the future, when you are RM, please run tests when bumping versions
> before pushing.
>
> A major release is hard enough with only one chef.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 3, 2016 at 8:52 AM, Shalin Shekhar Mangar
>  wrote:
>> Hmm I think I created the branch without pulling the latest code. I'll fix.
>>
>> On Thu, Mar 3, 2016 at 6:41 PM, Robert Muir  wrote:
>>> This is missing a bunch of yesterday's branch_6x changes. Some of
>>> david smiley's spatial work, at least one of my commits.
>>>
>>> On Thu, Mar 3, 2016 at 5:10 AM, Shalin Shekhar Mangar
>>>  wrote:
 FYI, I have created the branch_6_0 so that we can continue to commit
 stuff intended for 6.1 on master and branch_6x. I have also added the
 6.1.0 version on branch_6x and master.

 On Wed, Mar 2, 2016 at 9:51 PM, Shawn Heisey  wrote:
> On 3/2/2016 4:19 AM, Alan Woodward wrote:
>> Should we create a separate branch_6_0 branch for the feature-freeze?
>>  I have stuff to push into master and that should eventually make it
>> into 6.1, and it will be easy to forget to backport stuff if there's a
>> week before I can do that…
>
> +1
>
> When I saw Nick's email about branch_6x being feature frozen, my first
> thought was that we don't (and really can't) feature freeze the stable
> branch -- isn't new feature development (for the next minor release in
> the current major version) the entire purpose of branch_Nx?
>
> A feature freeze on a specific minor version does make sense.  I've seen
> a couple of people say that we have, but there are also a few messages
> from people saying that they want to include new functionality in 6.0.
> I expect that backporting almost anything from branch_6x to branch_6_0
> will be relatively easy, so it may be a good idea to just create the new
> branch.
>
> Thanks,
> Shawn
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>



 --
 Regards,
 Shalin Shekhar Mangar.

 

RE: [Possibly spoofed] Re: Solr/Lucene 6.x: Multiple public Query classes not immutable (yet)

2016-03-03 Thread Vanlerberghe, Luc
Thanks for reviewing,

There are a few things I’d like to change:

-  I didn’t leave any public MultiPhraseQuery constructors like you did 
for PhraseQuery.  Adding a few afterwards shouldn’t break anything though.

-  The private termArrays and positions members could become fixed 
arrays like you did for PhraseQuery.  This would change the signature of 
getTermArrays() and getPositions(), so perhaps it should happen now…

From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: donderdag 3 maart 2016 14:53
To: dev@lucene.apache.org
Subject: [Possibly spoofed] Re: Solr/Lucene 6.x: Multiple public Query classes 
not immutable (yet)

Hey Luc,


Le jeu. 3 mars 2016 à 11:23, Vanlerberghe, Luc 
<luc.vanlerber...@bvdinfo.com<mailto:luc.vanlerber...@bvdinfo.com>> a écrit :
Since it is part of the public API I would suggest splitting it in an immutable 
class and a builder like was done for most other Queries *before* releasing an 
official 6.x version.

+1
I started reviewing your patch on LUCENE-7064 and it looks good. I would lean 
towards having it in 6.0 since there are a couple other APIs that are still 
getting some last-minute adjustments but I will check with Nick who will be the 
release manager for 6.0.


Solr/Lucene 6.x: Multiple public Query classes not immutable (yet)

2016-03-03 Thread Vanlerberghe, Luc
Hi,

While checking how to migrate my custom components from lucene/solr 5.1 to 6.x 
I stumbled upon the fact that oal.search.MultiPhraseQuery is not immutable like 
most other implementations (see e.g.: 
https://issues.apache.org/jira/browse/LUCENE-6531)

Since it is part of the public API I would suggest splitting it in an immutable 
class and a builder like was done for most other Queries *before* releasing an 
official 6.x version.

I did a quick scan through all derived classes of Query and I compiled the 
following list (ignoring sources in test or contrib folders)
Some of them are already marked as experimental (but should perhaps receive the 
"official" @lucene.experimental tag ?)
For some it's possibly not an issue since they should never end up in a filter 
cache (like MoreLikeThisQuery ?), but then a comment specifying the exception 
to the rule should perhaps be added.

I'll probably already have a go at the MultiPhraseQuery case shortly and create 
a JIRA issue with a PR for it.

Luc Vanlerberghe

lucene/search:
- org.apache.lucene.search.MultiPhraseQuery

lucene/queries:
- org.apache.lucene.queries.CommonTermsQuery
- org.apache.lucene.queries.CustomScoreQuery (marked as @lucene.experimental)
- org.apache.lucene.queries.mlt.MoreLikeThisQuery

lucene/suggest:
- org.apache.lucene.search.suggest.document.ContextQuery (marked as 
@lucene.experimental)

lucene/facet:
- org.apache.lucene.facet.DrillDownQuery (marked as @lucene.experimental)

solr/core:
- org.apache.solr.search.ExtendedQueryBase
  Several derived classes, among which:
  - org.apache.solr.query.FilterQuery
  - org.apache.solr.query.SolrRangeQuery (marked as @lucene.experimental)
  - org.apache.solr.search.RankQuery (marked in comment as experimental, but 
not its derived classes)
  - org.apache.solr.search.WrappedQuery
- org.apache.solr.search.join.GraphQuery (marked as @lucene.experimental)
- org.apache.solr.search.SolrConstantScoreQuery (marked in comment as 
experimental, but not the derived FunctionRangeQuery)



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Fixed: Bug in solr.core TrieField (testcase provided as well)

2015-10-13 Thread Vanlerberghe, Luc
I came across the following bug in the Solr atomic update code a few weeks ago:
As soon as a document has a date in a multivalued tdate field, it is impossible 
to do atomic updates on any of the other fields.

I found a bug report in jira describing the exact same issue, so instead of 
creating a new one I attached a testcase demonstrating the bug, a workaround 
and later the fix for the bug to the existing one.

Unfortunately, the original report mistakenly does not classify it as a solr 
core issue, but a java client one, which is probably why no committer even 
acknowledged the existence of the problem.

I'm talking about https://issues.apache.org/jira/browse/SOLR-8050: "Partial 
update on document with multivalued date field fails"
The pull request in there contains three commits (for trunk):

-  A testcase demonstrating the bug

-  A testcase demonstrating a workaround (providing the tdate values 
again so solr doesn't have to try to reconstruct the existing ones)

-  A fix

I ran all testcases after applying the fix and they all passed (except for an 
unrelated file access issue, probably because I'm using Windows)

I'm running the fixed version for solr-5.1.0 in a test environment for a while 
now and I'm pretty confident it's ok.

Could anyone with sufficient karma please take a look?

Thanks,

Luc



An Exception in a non-query component should not cause a global HTTP 500 Server Error response

2015-03-18 Thread Vanlerberghe, Luc
Hi,

I ran into an issue where certain combinations of queries and sort orders cause 
an Exception in the Highlighter component that in turn causes a 500 Server 
Error.
I tracked down the cause to a problem in the tokenizer chain I use, but I do 
not have a quick solution yet.

The point I want to raise here is that the global 500 error for the whole of 
the response seems way too restrictive.
In addition the response body still contains the correct values for the query 
and facets itself but getting to it is awkward (I use SolrNet: I have to catch 
the WebException, get the response from there and hope the main parts of it are 
intact...), but at least it keeps the bulk of the application running.

My suggestion would be that exceptions thrown in components only affect the 
corresponding part of the response, but do not cause a 500 Server Error.
At the end of the response the lst name=error element could be made 
repetitive (with a component=... attribute?)

This way end users of the application might notice parts are not functioning 
properly in some circumstances, but the main application would still be usable.
The thrown exception should still be logged of course to enable the cause to be 
found (and a monitoring service to notify the admins and/or developers)

I made an exception for the query component since probably most components rely 
on it functioning correctly.

Luc Vanlerberghe



RE: [Possibly spoofed] Re: An Exception in a non-query component should not cause a global HTTP 500 Server Error response

2015-03-18 Thread Vanlerberghe, Luc
 For the most part, committers on this project believe in the fail
 early, fail hard philosophy.
Actually, I'm inclined that way as well and I wouldn’t mind it to occur on the 
indexing side: if garbage is fed into Solr or the tokenizer chain has errors: 
crash and burn, no problem :)

But here the response of Solr is a bit of both worlds:
- it sends a 500 response header to indicate the failure
- and it includes the response of the query and facet modules anyway in the 
response body
So now Solr sends multiple KB of data (in my case) just to indicate it failed 
somewhere while the response actually contains quite useful data...

 If you were using SolrJ [...], you probably would not even *get* the response 
 [...].
In SolrNet you don't simply get the response either, but the exception thrown 
contains it so that's what I used in my workaround for now.
I assume something similar is possible in SolrJ.

 If the behavior you want is *configurable*[...]
That's ok for me.  Perhaps even add the possibility not to send any response 
body (or just the exception details) other than the 500 status header.

I'll open a Jira issue shortly, but I wanted to mail the dev list first to make 
sure the idea hadn't popped up before (and been rejected or in progress)

Luc


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: woensdag 18 maart 2015 14:42
To: dev@lucene.apache.org
Subject: [Possibly spoofed] Re: An Exception in a non-query component should 
not cause a global HTTP 500 Server Error response

On 3/18/2015 4:56 AM, Vanlerberghe, Luc wrote:
 I ran into an issue where certain combinations of queries and sort
 orders cause an Exception in the Highlighter component that in turn
 causes a 500 Server Error.

 I tracked down the cause to a problem in the tokenizer chain I use,
 but I do not have a quick solution yet.

  

 The point I want to raise here is that the “global” 500 error for the
 whole of the response seems way too restrictive.

 In addition the response body still contains the correct values for
 the query and facets itself but getting to it is awkward (I use
 SolrNet: I have to catch the WebException, get the response from there
 and hope the main parts of it are intact…), but at least it keeps the
 bulk of the application running.


There are arguments both ways here.

For the most part, committers on this project believe in the fail
early, fail hard philosophy.  This idea is that a failure condition
should be detected as early as possible, and that it should result in a
complete failure.  Part of this philosophy comes from the way that Java
handles exceptions - when a program has caught an exception, Java
convention says that the code statement which threw the exception has
NOT done the work it was asked to do.  If you were using SolrJ (the Java
client for Solr), you probably would not even *get* the response from
the server if the client returned a 500 HTTP response because the server
had an exception.

What you are asking for here is the ability to say Even though part of
it failed, please give me the parts that didn't fail ... which goes
against this entire philosophy.  When considering default behavior, I
believe that the way things currently operate is correct.

If the behavior you want is *configurable* but does not happen by
default, that's a different story.  We have similar switches already to
allow other partial failures, like shards.tolerant.  When
shards.tolerant=true, a distributed request will succeed as long as at
least one of its shard subrequests succeed, even if some of them fail. 
In that situation, the HTTP response code would be a normal 200, not 4xx
or 5xx.

I would recommend opening an issue in the Jira issue tracker to ask for
a query parameter that enables the behavior you're looking for.  If you
can create a patch to implement the behavior, that's even better.

https://issues.apache.org/jira/browse/SOLR

http://grep.codeconsult.ch/2007/04/28/yoniks-law-of-half-baked-patches/

Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [Possibly spoofed] Re: Anybody having troubles building trunk?

2015-01-08 Thread Vanlerberghe, Luc
I had exactly the same issue building Solr, but using the tips here I managed 
to get everything working again:

I deleted the .ant and .ivy2 folders in my user directory, edited 
lucene\ivy-settings.xml to comment out the ibiblio namecloudera ... / and 
resolver ref=cloudera/ elements (leave the elements for 
releases.cloudera.com !)

After that ant ivy-bootstrap and ant resolve ran successfully (taking about 
5 minutes to download all dependencies)

I guess that one of the artifacts loaded from cloudera conflicts with one of 
the official ones from releases.cloudera.com (perhaps the order in the 
resolver chain should be reversed?)

Side note: For releases.cloudera.com, Mark Miller changed https to http on 
14/3/2014 to work around an expired SSL certificate.
I checked the certificate on the site and switched back to using https and it 
seems to be fine now...

Regards,

Luc

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: donderdag 8 januari 2015 6:32
To: dev@lucene.apache.org
Subject: [Possibly spoofed] Re: Anybody having troubles building trunk?

Similar but different? I got rid of cloudera references all together,
did ant clean and it is still the same error.

The build line that failed is:

ivy:retrieve conf=compile,compile.hadoop type=jar,bundle
sync=${ivy.sync} log=download-only symlink=${ivy.symlink}/

in trunk/solr/core/build.xml:65

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 8 January 2015 at 00:12, Steve Rowe sar...@gmail.com wrote:
 I had the same issue earlier today, and identified the problem here, along
 with a workaround:
 https://issues.apache.org/jira/browse/SOLR-4839?focusedCommentId=14268311page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14268311

 On Wed, Jan 7, 2015 at 10:36 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

 I am having dependencies issues even if I blow away everything, check
 it out again and do 'ant resolve':
 resolve:
 [ivy:retrieve]
 [ivy:retrieve] :: problems summary ::
 [ivy:retrieve]  WARNINGS
 [ivy:retrieve] ::
 [ivy:retrieve] ::  UNRESOLVED DEPENDENCIES ::
 [ivy:retrieve] ::
 [ivy:retrieve] ::
 org.restlet.jee#org.restlet.ext.servlet;2.3.0: configuration not found
 in org.restlet.jee#org.restlet.ext.servlet;2.3.0: 'master'. It was
 required from org.apache.solr#core;working@Alexs-MacBook-Pro.local
 compile
 [ivy:retrieve] ::
 [ivy:retrieve]
 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED

 Regards,
Alex.

 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



GitHub pull requests vs. Jira issues

2014-12-15 Thread Vanlerberghe, Luc
Hi.

I recently created two pull requests via GitHub that arrived on the dev list 
automatically.
(They may have ended up in spam since I hadn't configured my name and email 
yet, so the From: field was set to LucVL g...@git.apache.org)
I repeated the contents below just in case.

Do I need to set up corresponding JIRA issues to make sure they don't get lost 
(or at least to know if they are rejected...) or are GitHub pull requests also 
reviewed regularly?

Thanks,

Luc


https://github.com/apache/lucene-solr/pull/108

o.a.l.queryparser.flexible.standard.StandardQueryParser cleanup

* Removed unused, but confusing code (CONJ_AND == CONJ_OR == 2 ???). 
Unfortunately, the code generated by JavaCC from the updated 
StandardSyntaxParser.jj differs in more places than necessary.
* Replaced Vector by List/ArrayList.
* Corrected the javadoc for StandardQueryParser.setLowercaseExpandedTerms

ant test in the queryparser directory runs successfully



https://github.com/apache/lucene-solr/pull/113

BitSet fixes

* `LongBitSet.ensureCapacity` overflows on `numBits  Integer.MaxValue`
* `Fixed-/LongBitSet`: Avoid conditional branch in `bits2words` (with a 
comment explaining the formula)

TODO:
* Harmonize the use of `numWords` vs. `bits.length` vs. `numBits`
 * E.g.: `cardinality` scans up to `bits.length`, while `or` asserts on 
`indexnumBits`
* If a `BitSet` is allocated with `n` bits, `ensureCapacity` with the same 
number `n` shouldn't grow the `BitSet`
 * Either both should allocate a larger array than really needed or neither.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Cosmetic: Getting rid of an extra \n in TFIDFSimilarity.explainScore output

2014-11-26 Thread Vanlerberghe, Luc
TFIDFSimilarity.explainScore currently outputs an annoying (but harmless of 
course) extra \n.

It occurs because the freq argument is included as is in the description of the 
top Explain node,
whereas freq.getValue() is sufficient. The full freq Explain node is included 
as a detail further on anyway...

I attached a patch generated with git, but it's just:
-result.setDescription(score(doc=+doc+,freq=+freq+), product of:);
+result.setDescription(score(doc=+doc+,freq=+freq.getValue()+), 
product of:);

Output like this:

  lst name=explain
str name=0-764629
5.5484066 = (MATCH) max of:
  5.5484066 = (MATCH) weight(titreSearch:camus in 4158) [DefaultSimilarity], 
result of:
5.5484066 = score(doc=4158,freq=1.0 = termFreq=1.0
), product of:
  0.60149205 = queryWeight, product of:
9.224405 = idf(docFreq=450, maxDocs=1682636)
0.065206595 = queryNorm
  9.224405 = fieldWeight in 4158, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
9.224405 = idf(docFreq=450, maxDocs=1682636)
1.0 = fieldNorm(doc=4158)
/str
  /lst

becomes:

  lst name=explain
str name=0-764629
5.5484066 = (MATCH) max of:
  5.5484066 = (MATCH) weight(titreSearch:camus in 4158) [DefaultSimilarity], 
result of:
5.5484066 = score(doc=4158,freq=1.0), product of:
  0.60149205 = queryWeight, product of:
9.224405 = idf(docFreq=450, maxDocs=1682636)
0.065206595 = queryNorm
  9.224405 = fieldWeight in 4158, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
9.224405 = idf(docFreq=450, maxDocs=1682636)
1.0 = fieldNorm(doc=4158)
/str
  /lst



0001-Cleanup-of-TFIDFSimilarity.explainScore.patch
Description: 0001-Cleanup-of-TFIDFSimilarity.explainScore.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Cosmetic: Getting rid of an extra \n in TFIDFSimilarity.explainScore output

2014-11-26 Thread Vanlerberghe, Luc
The freq explanation itself is still included as detail a bit lower in the 
code (line 798 in my version)
so no information gets lost!

See:
   1.0 = termFreq=1.0

Luc

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: woensdag 26 november 2014 16:59
To: Lucene/Solr dev; Vanlerberghe, Luc
Subject: Re: Cosmetic: Getting rid of an extra \n in 
TFIDFSimilarity.explainScore output

Thank you for the patch!  I agree that is annoying.

It makes me a little nervous, losing possibly important explanation
about how that freq itself was computed?

E.g. a PhraseQuery will have phraseFreq=X as the explanation for
that freq, telling you this wasn't just a simple term freq ... I
wonder whether other queries want to explain an interesting freq?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 26, 2014 at 10:33 AM, Vanlerberghe, Luc
luc.vanlerber...@bvdinfo.com wrote:
 TFIDFSimilarity.explainScore currently outputs an annoying (but harmless of 
 course) extra \n.

 It occurs because the freq argument is included as is in the description of 
 the top Explain node,
 whereas freq.getValue() is sufficient. The full freq Explain node is included 
 as a detail further on anyway...

 I attached a patch generated with git, but it's just:
 -result.setDescription(score(doc=+doc+,freq=+freq+), product of:);
 +result.setDescription(score(doc=+doc+,freq=+freq.getValue()+), 
 product of:);

 Output like this:

   lst name=explain
 str name=0-764629
 5.5484066 = (MATCH) max of:
   5.5484066 = (MATCH) weight(titreSearch:camus in 4158) [DefaultSimilarity], 
 result of:
 5.5484066 = score(doc=4158,freq=1.0 = termFreq=1.0
 ), product of:
   0.60149205 = queryWeight, product of:
 9.224405 = idf(docFreq=450, maxDocs=1682636)
 0.065206595 = queryNorm
   9.224405 = fieldWeight in 4158, product of:
 1.0 = tf(freq=1.0), with freq of:
   1.0 = termFreq=1.0
 9.224405 = idf(docFreq=450, maxDocs=1682636)
 1.0 = fieldNorm(doc=4158)
 /str
   /lst

 becomes:

   lst name=explain
 str name=0-764629
 5.5484066 = (MATCH) max of:
   5.5484066 = (MATCH) weight(titreSearch:camus in 4158) [DefaultSimilarity], 
 result of:
 5.5484066 = score(doc=4158,freq=1.0), product of:
   0.60149205 = queryWeight, product of:
 9.224405 = idf(docFreq=450, maxDocs=1682636)
 0.065206595 = queryNorm
   9.224405 = fieldWeight in 4158, product of:
 1.0 = tf(freq=1.0), with freq of:
   1.0 = termFreq=1.0
 9.224405 = idf(docFreq=450, maxDocs=1682636)
 1.0 = fieldNorm(doc=4158)
 /str
   /lst



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


Extending pagination using cursorMark

2014-04-10 Thread Vanlerberghe, Luc
In Solr 4.7 an exciting new feature was added that allows one to page through a 
complete result set without having to worry about missing or double results at 
page boundaries while keeping resource utilization low.

I have a common use case that has similar performance and consistency problems 
that could be solved by extending the way CursorMarks work:

A. The user executes a search and obtains thousands of results of which he sees 
the first 'page'.
   Apart from scrolling through the list he also has a scrollbar (or paging 
controls) to jump to anywhere in the list.
B. The user uses the scrollbar to jump to an arbitrary place in the list.
C. The user scrolls down a bit (but past the current 'page') to find what he's 
looking for.
D. The user realizes he's too far down and scrolls up a bit again (but before 
the current 'page' again...)

(Yes, I know that users should be educated to refine their search, but 
unfortunately, if the client for which the application is developed specifies 
that it should be possible to use it this way...)

For the moment this is implemented by using the start/rows parameters to get 
the appropriate 'page' and this has the disadvantages that cursorMark solves:
- Solr (actually I use Lucene directly, but that doesn't matter here) needs to 
store *all* documents up to document (start+rows) to be able to returns just 
the rows requested. Except for step A (where start==0), this may be a huge 
performance hit.
- If the index is modified concurrently (especially when using NRT), jumping to 
the next/previous page can cause documents being repeated or skipped at page 
boundaries (as explained in 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results)

Here's the way an extension to the cursorMark system could solve the problem:
A. Solr/Lucene executes the search and returns the total number of hits and the 
requested number of top documents.
   start=0, rows=n, cursorMark=*
B. start=x, rows=n, cursorMark=*: Here Solr should allow combining both 
start!=0 and cursorMark=*. It should execute a normal request using start=x and 
rows=n and add two cursorMarks : on corresponding to the sort values of the 
first document and one corresponding to the sort values of the last document
C. Use cursorMark to get the 'next' pages: This is the same way cursorMark 
works for the moment:  the user passes the cursorMark corresponding to the sort 
values of the last document.
D. Use the cursorMark corresponding to the sort values of the first document to 
get the 'previous' pages.
a
In terms of implementing these changes, I've been looking at the source code 
and already did the easy ones :)
- If a cursorMark is passed (either cursorMark=* or a 'real' value), Solr 
should return two cursorMarks in the result: nextCursorMark as before and 
prevCursorMark corresponding to the sort values of the first document. Done.
- start!=0 and cursorMark=* should no longer be mutually exclusive (but 
start!=0 and cursorMark!=* should). Done.
- When returning a result using a cursorMark, the start value returned should 
correspond to the actual position of the first document in the full result set. 
 For the next page, this equals to the number of documents skipped during 
processing, but unfortunately I didn't see a way (yet) to pass that information 
along everywhere.  This start value, together with the (possibly changed) 
numFound value can be used in the GUI to adjust the position of the scrollbar 
or the paging controls accordingly without having to estimate it.
- Implementing reverse paging could actually be easier than it sounds by 
internally reversing the sort order (really reversing, not just reversing 
ASC/DESC!) using the cursor as in the normal case and afterwards reversing the 
obtained list of documents.  I've updated PagingFieldCollector in 
TopFieldCollector.java by negating the values in reverseMul and overriding 
topDocs(start, howMany), but have to check everywhere partial results are 
merged as well...
- Implement a corresponding amount of test cases for the paging up case as that 
exist for the paging down case (help! :)

While working on the code, I thought of another use case as well: refreshing 
the current page:
Instead of passing the same start value again, the prevCursorMark could be 
passed, but with a hint that the document on or after this cursorMark should be 
returned.

Which brings me to the question of how to specify the new behavior to Solr 
without affecting the current behavior.

I propose that prevCursorMark and nextCursorMark simply encode the sort values 
for the first and last document (as nextCursorMark does now) and that a simple 
prefix is used when cursorMark should be used differently:
: documents after the cursor position: use with nextCursorMark to get the 
next page of results
=: documents after or on the cursor position: use with prevCursorMark to 
refresh the same page keeping the same sort position for the first document
: documents before 

RE: [jira] [Commented] (LUCENENET-484) Some possibly major tests intermittently fail

2012-06-01 Thread Vanlerberghe, Luc
You're welcome.

By the way, the TestInsanity1 test could be forced to fail on every run by 
inserting a GC.Collect() call just before the CheckInsanity call.
I didn't patch the TestCase itself since it's ported from java code...

Luc

-Original Message-
From: Christopher Currens (JIRA) [mailto:j...@apache.org] 
Sent: donderdag 31 mei 2012 18:14
To: lucene-net-dev@lucene.apache.org
Subject: [jira] [Commented] (LUCENENET-484) Some possibly major tests 
intermittently fail


[ 
https://issues.apache.org/jira/browse/LUCENENET-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286699#comment-13286699
 ] 

Christopher Currens commented on LUCENENET-484:
---

Thanks Luc.  This is great stuff.  I'll run the patch on my local box and 
double check everything.  Your help with this is appreciated by all of us!

 Some possibly major tests intermittently fail
 --

 Key: LUCENENET-484
 URL: https://issues.apache.org/jira/browse/LUCENENET-484
 Project: Lucene.Net
  Issue Type: Bug
  Components: Lucene.Net Core, Lucene.Net Test
Affects Versions: Lucene.Net 3.0.3
Reporter: Christopher Currens
 Fix For: Lucene.Net 3.0.3

 Attachments: Lucenenet-484-WeakDictionary.patch, 
 Lucenenet-484-WeakDictionaryTests.patch


 These tests will fail intermittently in Debug or Release mode, in the core 
 test suite:
 # -Lucene.Net.Index:-
 #- -TestConcurrentMergeScheduler.TestFlushExceptions-
 # Lucene.Net.Store:
 #- TestLockFactory.TestStressLocks
 # Lucene.Net.Search:
 #- TestSort.TestParallelMultiSort
 # Lucene.Net.Util:
 #- TestFieldCacheSanityChecker.TestInsanity1
 #- TestFieldCacheSanityChecker.TestInsanity2
 #- (It's possible all of the insanity tests fail at one point or 
 another) # Lucene.Net.Support
 #- TestWeakHashTableMultiThreadAccess.Test
 TestWeakHashTableMultiThreadAccess should be fine to remove along with the 
 WeakHashTable in the Support namespace, since it's been replaced with 
 WeakDictionary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




RE: Lucene 2.0

2006-05-22 Thread Vanlerberghe, Luc
Is there a chance that LUCENE-485 makes it into 2.0?  There's a patch
attached and Doug added a +1 in the comments...

It reduces the time IndexWriter keeps the commit lock while cleaning up
obsolete files/segments.
The commit lock should be held while the 'working set' is changing, but
is not necessary during cleanup.

Thanks, Luc


-Original Message-
From: Chuck Williams [mailto:[EMAIL PROTECTED] 
Sent: vrijdag 19 mei 2006 1:23
To: java-dev@lucene.apache.org
Subject: Re: Lucene 2.0

I think Lucene-561 is in the egregious category and it has a patch to
fix it (be sure to get the most recent of the two).  Can this be
included?

Chuck

Yonik Seeley wrote on 05/18/2006 10:50 AM:
 On 5/18/06, DM Smith [EMAIL PROTECTED] wrote:
  at the monent, there are two Jira issues with a Fix version of
 2.0 still
  unresolved: LUCENE-556 and LUCENE-546

 I wouldn't seeing 415 being fixed, but I seem to be missing a way one
 changes Fix Version.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: weird behavior of IndexReader.indexExists()

2006-05-09 Thread Vanlerberghe, Luc
Make sure both instances are using the same lock directory.
The segments file should only be read or written while holding the
commit lock.

If the lock directories don't match, you'll get more 'strange' errors...

In Lucene 1.4.2 some methods did not use the lock, this has been patched
a couple of months ago.

Luc

-Original Message-
From: Andy Hind [mailto:[EMAIL PROTECTED] 
Sent: dinsdag 9 mei 2006 13:47
To: java-dev@lucene.apache.org
Subject: RE: weird behavior of IndexReader.indexExists()

Hi

I think I have discovered this too.
It is on my list of issues to raise  

The index exist test looks for the segment file.
When the index is committing, and you are unlucky, this file may not be
found as the new segments file replaces the old one. The result is the
index appears not to exist.

Regards

Andy

-Original Message-
From: wenjie zheng [mailto:[EMAIL PROTECTED] 
Sent: 08 May 2006 18:57
To: java-dev@lucene.apache.org
Subject: Re: weird behavior of IndexReader.indexExists()

This happens sometimes when number of docs is over 2000. So it's kinda
of
random.

Wenjie

On 5/8/06, wenjie zheng [EMAIL PROTECTED] wrote:

 I created an index with more than 30,000 text files.
 I used indexExists() to determine either to create a new index or to
add
 docs to the existing index.

 But when the num of docs in the index was over 3,000 (sometimes 3,400,
 sometimes 3,200), the indexExists function returns false, so I ended
up
 recreating a new index.

 Here is my code:
 if index exists, we will add files to it, otherwise, create a new
index.
 In either case, an IndexingThread will be spawn to do that.
  if(IndexReader.indexExists(indexDir)){
 logger.info(Working on existing index ...);
 IndexingThread.startIndexingThread(Username, new
 File(propsFile), new File(indexDir), docs,
   new StandardAnalyzer(), false);
   }else{
 logger.info(Create a new index ...);
 IndexingThread.startIndexingThread(Username, new
 File(propsFile), new File(indexDir), docs,
new StandardAnalyzer(), true);
}


 inside the startIndexingThread function, I am calling the following
 function to add files to the index:
 /**
  * Add an array of Files to an index
  *
  * @param propsFile the properties file
  * @param indexDir  the folder where index files will be created
in
  * @param docs  an array of Files to be add to the index
  * @param analyzer  any Analyzer object
  */
 public void addFiles(File propsFile, File indexDir, File[] docs,
 Analyzer analyzer, boolean overwrite) throws Exception {
 Properties props = new Properties(new
FileInputStream(propsFile));

 if(overwrite || IndexReader.indexExists(indexDir)){ //either
 overwrite or working on an existing index
 Directory index = FSDirectory.getDirectory(indexDir,
 overwrite);
 IndexWriter writer = new IndexWriter(index, analyzer,
 overwrite);

 FileIndexer indexer = new FileIndexer(props);

 long start = new Date().getTime();
 indexer.index(writer, docs);
 writer.optimize();
 writer.close (); // close the writer
 index.close();  // close the index Directory
 long end = new Date().getTime();

 logger.info(Total time:  + (end - start) +  ms);

 }else{
 logger.error(Index files are not found:  +
 indexDir.getAbsolutePath() + , overwrite = false);
 }
 }

 Thanks,
 Wenjie



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: GData, updateable IndexSearcher

2006-04-28 Thread Vanlerberghe, Luc
Here are some remarks from what I learned by inspecting the code (quite
a while ago now, but the principle shouldn't have changed)...

When an IndexReader opens the segments of an index it 
- grabs the commit lock, 
- reads the segments file for the list of segment names.
- opens the files for each segment (except the .del one), 
- *loads* the .del files associated with each segment (if present) and
then 
- releases the commit lock.

The segment files never change, and the .del files are loaded in memory
so an open IndexReader will always have the same view of its segments,
even if the .del files are changed by an other IndexReader.

So if you want to implement reopen() of a segment, you should be fine by
just reloading the .del file in memory for that segment (while holding
the commit lock of course).

Luc

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: donderdag 27 april 2006 20:30
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: GData, updateable IndexSearcher

On 4/27/06, Robert Engels [EMAIL PROTECTED] wrote:
 I thought each segment maintained its own list of deleted documents

Right.

 (since segments are WRITE ONCE

Yes, but deletions are the exception to that rule.  Once written,
segment files never change, except for the file that tracks deleted
documents for that segment.

Hence, if the segment name is the same, you should be able to count on
everything being unchanged *except* for which documents are deleted.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2005-11-18 Thread Vanlerberghe, Luc
Yes, but that's only the source code level you set (so you won't be able
to use generics or the for each construct if you set 1.4)
source and target are both set in common-build.xml

You should also set the library to use to 1.4...

I made the same mistake at first too.

Luc

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: vrijdag 18 november 2005 8:36
To: java-dev@lucene.apache.org
Subject: Re: [EMAIL PROTECTED]: Project lucene-java (in module lucene-java)
failed


: It's fixed now.
: Sorry bout that... I've already set up a test script to switch my JDK
: to 14 before running ant test.

I don't remember the specifics, but isn't there an attribute for the and
javac taks that you can use to tell it wether you want it to compile
as
1.4 code or 1.5 code? ... I thought setting that option to 1.4 worked
even
if you had a newever compiler (telling the compiler to use backwards
compatible mode)

?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Vanlerberghe, Luc
One reason to choose the 'simplistic IndexReader' approach to this
problem over regex's is that the result should be 'bug-compatible' with
a standard search over all documents.

Differences between the two systems would be difficult to explain to an
end-user (let alone for the developer to debug and find the reason in
the first place!)

Luc

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Saturday, April 16, 2005 2:09 AM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings

On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
 Cool! For my use case it would need to be able to handle arbitrary 
 queries (previously parsed from a general lucene query string).
 Something like:

   float match(String Text, Query query)

 it's fine with me if it also works for

   float[] match(String[] texts, Query query) or
   float(Document doc, Query query)

 but that isn't required by the use case.

My implementation is nearly that.  The score is available as
hits.score(0).  You would also need an analyzer, I presume, passed to
your proposed match() method if you want the text broken into terms.  
My current implementation is passed a String[] where each item is
considered a term for the document.  match() would also need a field
name to be fully accurate - since the analyzer needs a field name and
terms used for searching need a field name.  The Query may contain terms
for any number of fields - how should that be handled?  Should only a
single field name be passed in and any terms request for other fields be
ignored?  Or should this utility morph to assume any words in the text
is in any field being asked of it?

As for Doug's devil advocate questions - I really don't know what I'd
use it for personally (other than the match this single string against
a bunch of queries), I just thought it was clever that it could be
done.  Clever regex's could come close, but it'd be a lot more effort
than reusing good ol' QueryParser and this simplistic IndexReader, along
with an Analyzer.

Erik


 Wolfgang.

 I am intrigued by this and decided to mock a quick and dirty example 
 of such an IndexReader.  After a little trial-and-error I got it 
 working at least for TermQuery and WildcardQuery.  I've pasted my 
 code below as an example, but there is much room for improvement, 
 especially in terms of performance and also in keeping track of term 
 frequency, and also it would be nicer if it handled the analysis 
 internally.

 I think something like this would make a handy addition to our 
 contrib area at least.  I'd be happy to receive improvements to this 
 and then add it to a contrib subproject.

 Perhaps this would be a handy way to handle situations where users 
 have queries saved in a system and need to be alerted whenever a new 
 document arrives matching the saved queries?

  Erik





 -Original Message-
 From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 14, 2005 4:04 PM
 To: java-dev@lucene.apache.org
 Subject: Re: [Performance] Streaming main memory indexing of single 
 strings


 This seems to be a promising avenue worth exploring. My gutfeeling 
 is that this could easily be 10-100 times faster.

 The drawback is that it requires a fair amount of understanding of 
 intricate Lucene internals, pulling those pieces together and 
 adapting them as required for the seemingly simple float 
 match(String text, Query query).

 I might give it a shot but I'm not sure I'll be able to pull this 
 off!
 Is there any similar code I could look at as a starting point?

 Wolfgang.

 On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:

 I think you are not approaching this the correct way.

 Pseudo code:

 Subclass IndexReader.

 Get tokens from String 'document' using Lucene analyzers.

 Build simple hash-map based data structures using tokens for terms,

 and term positions.

 reimplement termDocs() and termPositions() to use the structures 
 from above.

 run searches.

 start again with next document.



 -Original Message-
 From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 14, 2005 2:56 PM
 To: java-dev@lucene.apache.org
 Subject: Re: [Performance] Streaming main memory indexing of single

 strings


 Otis, this might be a misunderstanding.

 - I'm not calling optimize(). That piece is commented out you if 
 look again at the code.
 - The *streaming* use case requires that for each query I add one 
 (and only one) document (aka string) to an empty index:

 repeat N times (where N is millions or billions):
add a single string (aka document) to an empty index
query the index
drop index (or delete it's document)

 with the following API being called N times: float match(String 
 text, Query query)

 So there's no possibility of adding many documents and thereafter 
 running the query. This in turn seems to mean that the IndexWriter 
 can't be kept open -