from:"Doron Cohen"

Re: [VOTE] Release Lucene/Solr 8.5.2 RC1

2020-05-24 Thread Doron Cohen

+1  SUCCESS! [1:29:21.990727]   - GNU/Linux Ubuntu 4.4.0-174-generic
#204-Ubuntu

-
At the same time it failed on Windows10 - WSL.
Tried several times, always failing on test-lock-factory, with a message
like:
"lockStressTestN: IllegalStateException: id 1 got lock, but 2 already
holds the lock"
I assume it is a problem with my setup, not yet fully trusting that
Cygwin/Win/WSL combination.
So not considering this a real failure, still, seems worth to mention,
perhaps others tried a similar setup.

On Sat, 23 May 2020 at 10:39, Shalin Shekhar Mangar 
wrote:

> +1
>
> SUCCESS! [0:47:23.934909]
>
> On Wed, May 20, 2020 at 11:28 PM Mike Drob  wrote:
>
>> Devs,
>>
>> Please vote for release candidate 1 for Lucene/Solr 8.5.2
>>
>> The artifacts can be downloaded from:
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.2-RC1-rev384dadd9141cec3f848d8c416315dc2384749814
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.2-RC1-rev384dadd9141cec3f848d8c416315dc2384749814
>>
>> The vote will be open until 2020-05-26 18:00 UTC (extended deadline due
>> to multiple holidays in the next 72 hours)
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1 (non-binding)
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

4.0 and 4.1 FieldCacheImpl.DocTermsImpl.exists(docid) possibly broken

2013-07-18 Thread Doron Cohen

Hi, just an FYI - may be helpful for anyone obliged to use 4.0.0 or 4.1.0 -
it seems that this method is actually doing the opposite of its intention.

I did not find mentions of this in the lists or elsewhere.

This is the code for o.a.l.search.FieldCacheImpl.DocTermsImpl.exists(int):
public boolean exists(int docID) {
  return docToOffset.get(docID) == 0;
}

Its description says: "Returns true if this doc has this field and is not
deleted".
But it returns true for docs not containing the field and false for those
that do contain it.

A simple workaround is to not to call this method before calling getTerm()
but rather just rely on getTerm()  logic: "... returns the same BytesRef,
or an empty (length=0) BytesRef if the doc did not have this field or was
deleted."

So usage code can be like this:
DocTerms values =  FieldCache.DEFAULT.getTerms(reader, FIELD_NAME);
 BytesRef term = new BytesRef();
for (int docid=0; docid0) {
doSomethingWith(term.utf8ToString());
}
 }
FieldCache.DEFAULT.purge(reader);

I am not sure about the overhead of this comparing to first checking
exists(), but it at least work correctly.

The code for exists() was as above until R1442497 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?revision=1442497&view=markup)
and then in R1443717 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?r1=1442497&r2=1443717&diff_format=h)
the API was change as part of LUCENE-4547 (DocValues improvements) which
was included in 4.2.

Simple code to demonstrate this (here with 4.1 but same results with 4.0):

RAMDirectory d = new RAMDirectory();
IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_41,
new SimpleAnalyzer(Version.LUCENE_41)));
 w.addDocument(new Document()); // Empty doc (0, 0)
 Document doc = new Document(); // Real doc (1, 1)
doc.add(new StringField("f1", "v1", Store.NO));
w.addDocument(doc);
 w.addDocument(new Document()); // Empty doc (2, 2)
w.addDocument(new Document()); // Empty doc (3, 3)
w.commit(); // Commit - so we'll have two atomic readers
 doc = new Document(); // RealDoc (0, 4)
doc.add(new StringField("f1", "v2", Store.NO));
w.addDocument(doc);
w.addDocument(new Document()); // Empty doc (1, 5)
w.close();

IndexReader r = DirectoryReader.open(d);
BytesRef br = new BytesRef();
for (AtomicReaderContext leaf : r.leaves()) {
System.out.println("--- new atomic reader");
AtomicReader reader = leaf.reader();
DocTerms a = FieldCache.DEFAULT.getTerms(reader, "f1");
for (int i = 0; i < reader.maxDoc(); ++i) {
int n = leaf.docBase + i;
System.out.println(n+" exists: "+a.exists(i));
br = a.getTerm(i, br);
if (br.length > 0) {
System.out.println(n + "  "+br.utf8ToString());
}
}
}

The result printing:

  --- new atomic reader
  0 exists: true
  1 exists: false
  1  v1
  2 exists: true
  3 exists: true
  --- new atomic reader
 4 exists: false
 4  v2
 5 exists: true

Indeed, exists() results are wrong.

So again, just an FYI, as this API no longer exists...

Regards,
Doron

[jira] [Resolved] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-10 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-4590.
-

Resolution: Fixed

done.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4590.patch
>
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-10 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reopened LUCENE-4590:
-

Lucene Fields:   (was: New)

Reopen issue for making the categories file name method public: 
categoriesLineFile() so that it can easily be modified in the future without 
breaking apps logic.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4590.patch
>
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: commit message format for tag bot

2012-12-09 Thread Doron Cohen

Thanks Mark, this bot is very helpful, and now even more so!


On Sun, Dec 9, 2012 at 5:37 PM, Mark Miller  wrote:

> Okay, I've made the following changes:
>
> 1. Doesn't look for the : after an issue id anymore, so LUCENE-101 is fine
> as well as LUCENE-101:
>
> 2. If there are multiple issue id's, each one is tagged.
>
> 3. Msg no longer cuts off what came before the first issue id.
>
> - Mark
>
> On Dec 9, 2012, at 10:13 AM, Uwe Schindler  wrote:
>
> > Thank you. It is nice that we have that bot, if we can improve it, it
> will get better!
> >
> > Uwe
> >
> >
> >
> > Mark Miller  schrieb:
> > I'll look at making some mods with this feedback.
> >
> > - Mark
> >
> > On Dec 9, 2012, at 5:35 AM, Uwe Schindler  wrote:
> >
> > Hi,
> >
> > One other thing is: The bot does not take the full commit message, e.g.
> if it spawns multiple lines, it misses the later ones. Also it does not add
> the merge messages my client generally adds, e.g.:
> >
> > Merged revision(s) 1417694 from lucene/dev/trunk:
> > LUCENE-4589: Upgraded benchmark module's Nekohtml dependency to version
> 1.9.17, removing the workaround in Lucene's HTML parser for the Turkish
> locale
> >
> > The first line is missing. This may not be important but I don’t like my
> commit message be stripped down and possibly important parts left out. I
> generally try to format the
> > commit messages, sometimes with new paragraphs and so on.
> >
> >
> > Also it seems to have problems with Multi-Issue commits.
> >
> > In my opinion, the bot should copy the *whole* commit message to all
> issue-ids listed anywhere in the message (regex /(LUCENE|SOLR)\-\d+/) and
> *not* try to parse the message.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > From: Doron Cohen [mailto:cdor...@gmail.com]
> > Sent: Sunday, December 09, 2012 11:26 AM
> > To: dev@lucene.apache.org
> > Subject: commit message format for tag bot
> >
> > Hi,
> >
> > It is great that the tag bot is adding back those commit version numbers
> to JIRA.
> >
> > Just noticed the tag bot did not take a message formatted like this:
> > LUCENE-4588 (cont): some text
> >
> > Assuming it would take multiple messages per issue, it probably accepts
> only this
> > format:
> >
> > LUCENE-4588: (cont) - some text
> >
> > So I'll format the message "correctly" next time, and perhaps add the
> two missing continuation commits manually (needed to know them for merging
> to 4x).
> >
> > Just thought I'll share this with others..
> >
> > Doron
> >
> >
> >
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> > --
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, 28213 Bremen
> > http://www.thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Resolved] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-09 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-4590.
-

Resolution: Fixed

Done.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4590.patch
>
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4595) EnwikiContentSource thread safety problem (NPE) in 'forever' mode

2012-12-09 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-4595.
-

   Resolution: Fixed
Lucene Fields:   (was: New)

Fixed.

Seems the tag bot missed the trunk commit for this one,
so her they are both:

- trunk: [r1418281|http://svn.apache.org/viewvc?view=revision&revision=1418281]
- 4x: [r1418925|http://svn.apache.org/viewvc?view=revision&revision=1418925]

> EnwikiContentSource thread safety problem (NPE) in 'forever' mode
> -
>
> Key: LUCENE-4595
> URL: https://issues.apache.org/jira/browse/LUCENE-4595
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4595.patch
>
>
> If close() is invoked around when an additional input stream reader is 
> recreated for the 'forever' behavior, an uncaught NPE might occur.
> This bug was probably always there, just exposed now with the 
> EnwikioContentSourceTest added in LUCENE-4588.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-09 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-4588.
-

   Resolution: Fixed
Lucene Fields:   (was: New)

Fixed.

As a side note, merging benchmark changes to 4x is so much easier than it used 
to be in 3x, now that trunk and branch are structured the same! Now if only 
'precommit' would run 60 times faster (that would be 12 seconds here)... 
wouldn't that be great? :) 

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4588.patch
>
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-09 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527399#comment-13527399
 ] 

Doron Cohen commented on LUCENE-4588:
-

Two more commits to trunk (uncaught by bot due to incorrect message format):
- [r1417871|http://svn.apache.org/viewvc?rev=1417871&view=rev] -- LUCENE-4588 
(cont): (EnwikiContentSource fixes) avoid using the forbidden
StringBufferInputStream..
- [r1417921|http://svn.apache.org/viewvc?rev=1417921&view=rev] -- LUCENE-4588 
(cont): simplify test input stream crration. 

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4588.patch
>
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

commit message format for tag bot

2012-12-09 Thread Doron Cohen

Hi,

It is great that the tag bot is adding back those commit version numbers to
JIRA.

Just noticed the tag bot did not take a message formatted like this:
  LUCENE-4588 (cont): some text

Assuming it would take multiple messages per issue, it probably accepts
only this format:
  LUCENE-4588: (cont) - some text

So I'll format the message "correctly" next time, and perhaps add the two
missing continuation commits manually (needed to know them for merging to
4x).

Just thought I'll share this with others..

Doron

Re: lost entries in trunk/lecene/CHANGES.txt

2012-12-09 Thread Doron Cohen

done.


On Sun, Dec 9, 2012 at 10:29 AM, Doron Cohen  wrote:

> Hi, seems some entries were lost when committing LUCENE-4585 (Spatial
> PrefixTree based Strategies).
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/CHANGES.txt?r1=1418005&r2=1418006&pathrev=1418006&view=diff
> I think I'll just add them back...
> Doron
>

lost entries in trunk/lecene/CHANGES.txt

2012-12-09 Thread Doron Cohen

Hi, seems some entries were lost when committing LUCENE-4585 (Spatial
PrefixTree based Strategies).
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/CHANGES.txt?r1=1418005&r2=1418006&pathrev=1418006&view=diff
I think I'll just add them back...
Doron

[jira] [Commented] (LUCENE-4595) EnwikiContentSource thread safety problem (NPE) in 'forever' mode

2012-12-07 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526326#comment-13526326
 ] 

Doron Cohen commented on LUCENE-4595:
-

Thanks for verifying Robert.
Committed the fix, let's see if the build becomes stable again.
Issue remains open for porting to 4x.

> EnwikiContentSource thread safety problem (NPE) in 'forever' mode
> -
>
> Key: LUCENE-4595
> URL: https://issues.apache.org/jira/browse/LUCENE-4595
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4595.patch
>
>
> If close() is invoked around when an additional input stream reader is 
> recreated for the 'forever' behavior, an uncaught NPE might occur.
> This bug was probably always there, just exposed now with the 
> EnwikioContentSourceTest added in LUCENE-4588.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-06 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-4590:


Attachment: LUCENE-4590.patch

Patch with the new task and a test.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4590.patch
>
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-06 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13514649#comment-13514649
 ] 

Doron Cohen commented on LUCENE-4590:
-

Now I see what you mean. Spooky, it is as if you were looking into the patch I 
did not post here.. How did you know I chose not to modify EnwikiConentSource...

I agree that if someone wishes to index just the non-category pages, the new 
WriteEnwikiLineDoc would create the category pages file for no use. Also, if 
indexing is conducted straight away, not through a line file first, categories 
will be indexed. But then anyone could check the title and decide not to index 
those docs. So I see the advantage, just not tempted to add this at the moment, 
but it can be added.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-06 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13514644#comment-13514644
 ] 

Doron Cohen commented on LUCENE-4588:
-

Thanks for the review Shai, changed as you suggested and committed (while jira 
was down...)

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4588.patch
>
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4595) EnwikiContentSource thread safety problem (NPE) in 'forever' mode

2012-12-06 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-4595:


Attachment: LUCENE-4595.patch

Patch supposed to fix this.
But I was not able to recreate the bug, so couldn't actually test it.

> EnwikiContentSource thread safety problem (NPE) in 'forever' mode
> -
>
> Key: LUCENE-4595
> URL: https://issues.apache.org/jira/browse/LUCENE-4595
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4595.patch
>
>
> If close() is invoked around when an additional input stream reader is 
> recreated for the 'forever' behavior, an uncaught NPE might occur.
> This bug was probably always there, just exposed now with the 
> EnwikioContentSourceTest added in LUCENE-4588.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.7.0_09) - Build # 2076 - Still Failing!

2012-12-06 Thread Doron Cohen

Created LUCENE-4595 for this.


On Thu, Dec 6, 2012 at 6:54 PM, Policeman Jenkins Server <
jenk...@sd-datasolutions.de> wrote:

> Build:
> http://jenkins.sd-datasolutions.de/job/Lucene-Solr-trunk-Windows/2076/
> Java: 32bit/jdk1.7.0_09 -server -XX:+UseConcMarkSweepGC
>
> 1 tests failed.
> REGRESSION:
>  org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest.testForever
>
> Error Message:
> Captured an uncaught exception in thread: Thread[id=400, name=Thread-189,
> state=RUNNABLE, group=TGRP-EnwikiContentSourceTest]
>
> Stack Trace:
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=400, name=Thread-189,
> state=RUNNABLE, group=TGRP-EnwikiContentSourceTest]
> at
> __randomizedtesting.SeedInfo.seed([59885A060E0846A6:1DF2E4FD80113111]:0)
> Caused by: java.lang.NullPointerException
> at __randomizedtesting.SeedInfo.seed([59885A060E0846A6]:0)
> at java.io.Reader.(Reader.java:78)
> at java.io.InputStreamReader.(InputStreamReader.java:129)
> at
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
> at java.lang.Thread.run(Thread.java:722)
>
>
>
>
> Build Log:
> [...truncated 5909 lines...]
> [junit4:junit4] Suite:
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest
> [junit4:junit4]   2> Di? 06, 2012 8:54:42 WN
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
> uncaughtException
> [junit4:junit4]   2> WARNING: Uncaught exception in thread:
> Thread[Thread-189,5,TGRP-EnwikiContentSourceTest]
> [junit4:junit4]   2> java.lang.NullPointerException
> [junit4:junit4]   2>at
> __randomizedtesting.SeedInfo.seed([59885A060E0846A6]:0)
> [junit4:junit4]   2>at java.io.Reader.(Reader.java:78)
> [junit4:junit4]   2>at
> java.io.InputStreamReader.(InputStreamReader.java:129)
> [junit4:junit4]   2>at
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
> [junit4:junit4]   2>at java.lang.Thread.run(Thread.java:722)
> [junit4:junit4]   2>
> [junit4:junit4]   2> NOTE: reproduce with: ant test
>  -Dtestcase=EnwikiContentSourceTest -Dtests.method=testForever
> -Dtests.seed=59885A060E0846A6 -Dtests.slow=true -Dtests.locale=mt_MT
> -Dtests.timezone=Indian/Mahe -Dtests.file.encoding=US-ASCII
> [junit4:junit4] ERROR   0.35s | EnwikiContentSourceTest.testForever <<<
> [junit4:junit4]> Throwable #1:
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=400, name=Thread-189,
> state=RUNNABLE, group=TGRP-EnwikiContentSourceTest]
> [junit4:junit4]>at
> __randomizedtesting.SeedInfo.seed([59885A060E0846A6:1DF2E4FD80113111]:0)
> [junit4:junit4]> Caused by: java.lang.NullPointerException
> [junit4:junit4]>at
> __randomizedtesting.SeedInfo.seed([59885A060E0846A6]:0)
> [junit4:junit4]>at java.io.Reader.(Reader.java:78)
> [junit4:junit4]>at
> java.io.InputStreamReader.(InputStreamReader.java:129)
> [junit4:junit4]>at
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
> [junit4:junit4]>at java.lang.Thread.run(Thread.java:722)
> [junit4:junit4]   2> NOTE: test params are: codec=SimpleText,
> sim=DefaultSimilarity, locale=mt_MT, timezone=Indian/Mahe
> [junit4:junit4]   2> NOTE: Windows 7 6.1 x86/Oracle Corporation 1.7.0_09
> (32-bit)/cpus=2,threads=1,free=124838400,total=145285120
> [junit4:junit4]   2> NOTE: All tests run in this JVM:
> [SearchWithSortTaskTest, TestConfig, TestHtmlParser, CreateIndexTaskTest,
> AddIndexesTaskTest, StreamUtilsTest, TrecContentSourceTest,
> TestPerfTasksLogic, TestQualityRun, LineDocSourceTest, TestPerfTasksParse,
> WriteLineDocTaskTest, DocMakerTest, PerfTaskTest, AltPackageTaskTest,
> EnwikiContentSourceTest]
> [junit4:junit4] Completed in 0.39s, 3 tests, 1 error <<< FAILURES!
>
> [...truncated 9 lines...]
> BUILD FAILED
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:335:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:39:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\build.xml:520:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\common-build.xml:1696:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\module-build.xml:61:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\common-build.xml:1167:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\common-build.xml:831:
> There were test fail

[jira] [Commented] (LUCENE-4595) EnwikiContentSource thread safety problem (NPE) in 'forever' mode

2012-12-06 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512113#comment-13512113
 ] 

Doron Cohen commented on LUCENE-4595:
-

Jenkin's reproduce params and error log: 
{noformat}
Build: http://jenkins.sd-datasolutions.de/job/Lucene-Solr-trunk-Linux/3093/
Java: 32bit/jdk1.6.0_37 -server -XX:+UseSerialGC

1 tests failed.
FAILED:  
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest.testForever

Error Message:
Captured an uncaught exception in thread: Thread[id=140, name=Thread-2, 
state=RUNNABLE, group=TGRP-EnwikiContentSourceTest]

Stack Trace:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=140, name=Thread-2, state=RUNNABLE, 
group=TGRP-EnwikiContentSourceTest]
at 
__randomizedtesting.SeedInfo.seed([EF7AF10441351C3B:AB004FFFCF2C6B8C]:0)
Caused by: java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([EF7AF10441351C3B]:0)
at java.io.Reader.(Reader.java:61)
at java.io.InputStreamReader.(InputStreamReader.java:112)
at 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
at java.lang.Thread.run(Thread.java:662)

Build Log:
[...truncated 5173 lines...]
[junit4:junit4] Suite: 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest
[junit4:junit4]   2> 7 Δεκ 2012 6:39:53 πμ 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
[junit4:junit4]   2> WARNING: Uncaught exception in thread: 
Thread[Thread-2,5,TGRP-EnwikiContentSourceTest]
[junit4:junit4]   2> java.lang.NullPointerException
[junit4:junit4]   2>at 
__randomizedtesting.SeedInfo.seed([EF7AF10441351C3B]:0)
[junit4:junit4]   2>at java.io.Reader.(Reader.java:61)
[junit4:junit4]   2>at 
java.io.InputStreamReader.(InputStreamReader.java:112)
[junit4:junit4]   2>at 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
[junit4:junit4]   2>at java.lang.Thread.run(Thread.java:662)
[junit4:junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=EnwikiContentSourceTest -Dtests.method=testForever 
-Dtests.seed=EF7AF10441351C3B -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=el -Dtests.timezone=SST -Dtests.file.encoding=UTF-8
[junit4:junit4] ERROR   0.07s J1 | EnwikiContentSourceTest.testForever <<<
[junit4:junit4]> Throwable #1: 
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=140, name=Thread-2, state=RUNNABLE, 
group=TGRP-EnwikiContentSourceTest]
[junit4:junit4]>at 
__randomizedtesting.SeedInfo.seed([EF7AF10441351C3B:AB004FFFCF2C6B8C]:0)
[junit4:junit4]> Caused by: java.lang.NullPointerException
[junit4:junit4]>at 
__randomizedtesting.SeedInfo.seed([EF7AF10441351C3B]:0)
[junit4:junit4]>at java.io.Reader.(Reader.java:61)
[junit4:junit4]>at 
java.io.InputStreamReader.(InputStreamReader.java:112)
[junit4:junit4]>at 
org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:186)
[junit4:junit4]>at java.lang.Thread.run(Thread.java:662)
[junit4:junit4]   2> NOTE: test params are: codec=Lucene41: {}, 
sim=DefaultSimilarity, locale=el, timezone=SST
[junit4:junit4]   2> NOTE: Linux 3.2.0-34-generic i386/Sun Microsystems Inc. 
1.6.0_37 (32-bit)/cpus=8,threads=1,free=47084536,total=64946176
[junit4:junit4]   2> NOTE: All tests run in this JVM: [TrecContentSourceTest, 
TestConfig, DocMakerTest, SearchWithSortTaskTest, StreamUtilsTest, 
WriteLineDocTaskTest, CreateIndexTaskTest, TestQualityRun, LineDocSourceTest, 
TestPerfTasksParse, AddIndexesTaskTest, PerfTaskTest, AltPackageTaskTest, 
EnwikiContentSourceTest]
[junit4:junit4] Completed on J1 in 0.30s, 3 tests, 1 error <<< FAILURES!
{noformat}

> EnwikiContentSource thread safety problem (NPE) in 'forever' mode
> -
>
> Key: LUCENE-4595
> URL: https://issues.apache.org/jira/browse/LUCENE-4595
> Project: Lucene - Core
>      Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
>
> If close() is invoked around when an additional input stream reader is 
> recreated for the 'forever' behavior, an uncaught NPE might occur.
> This bug was probably always there, just exposed now with the 
> EnwikioContentSourceTest added in LUCENE-4588.

--
This message is automatically generated by JI

[jira] [Created] (LUCENE-4595) EnwikiContentSource thread safety problem (NPE) in 'forever' mode

2012-12-06 Thread Doron Cohen (JIRA)

Doron Cohen created LUCENE-4595:
---

 Summary: EnwikiContentSource thread safety problem (NPE) in 
'forever' mode
 Key: LUCENE-4595
 URL: https://issues.apache.org/jira/browse/LUCENE-4595
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor


If close() is invoked around when an additional input stream reader is 
recreated for the 'forever' behavior, an uncaught NPE might occur.
This bug was probably always there, just exposed now with the 
EnwikioContentSourceTest added in LUCENE-4588.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1417939 - /lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/EnwikiContentSourceTest.java

2012-12-06 Thread Doron Cohen

Thanks Uwe.


On Thu, Dec 6, 2012 at 5:23 PM,  wrote:

> Author: uschindler
> Date: Thu Dec  6 15:23:13 2012
> New Revision: 1417939
>
> URL: http://svn.apache.org/viewvc?rev=1417939&view=rev
> Log:
> fix eol-style
>
> Modified:
>
> lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/EnwikiContentSourceTest.java
>   (props changed)
>
>

Re: svn commit: r1417871 - /lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/EnwikiContentSourceTest.java

2012-12-06 Thread Doron Cohen

... no idea ... :)
done.
thanks Uwe


On Thu, Dec 6, 2012 at 4:03 PM, Uwe Schindler  wrote:

> Why not simply use:
>
> new ByteArrayInputStream(docs.getBytes(IOUtils.CHARSET_UTF_8))
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: dor...@apache.org [mailto:dor...@apache.org]
> > Sent: Thursday, December 06, 2012 2:34 PM
> > To: comm...@lucene.apache.org
> > Subject: svn commit: r1417871 -
> > /lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchm
> > ark/byTask/feeds/EnwikiContentSourceTest.java
> >
> > Author: doronc
> > Date: Thu Dec  6 13:33:34 2012
> > New Revision: 1417871
> >
> > URL: http://svn.apache.org/viewvc?rev=1417871&view=rev
> > Log:
> > LUCENE-4588 (cont): (EnwikiContentSource fixes) avoid using the forbidden
> > StringBufferInputStream..
> >
> > Modified:
> >
> > lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchma
> > rk/byTask/feeds/EnwikiContentSourceTest.java
> >
> > Modified:
> > lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchma
> > rk/byTask/feeds/EnwikiContentSourceTest.java
> > URL:
> > http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/benchmark/src/tes
> > t/org/apache/lucene/benchmark/byTask/feeds/EnwikiContentSourceTest.ja
> > va?rev=1417871&r1=1417870&r2=1417871&view=diff
> > ==
> > 
> > ---
> > lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchma
> > rk/byTask/feeds/EnwikiContentSourceTest.java (original)
> > +++
> > lucene/dev/trunk/lucene/benchmark/src/test/org/apache/lucene/benchma
> > +++ rk/byTask/feeds/EnwikiContentSourceTest.java Thu Dec  6 13:33:34
> > +++ 2012
> > @@ -17,14 +17,17 @@ package org.apache.lucene.benchmark.byTa
> >   * limitations under the License.
> >   */
> >
> > +import java.io.ByteArrayInputStream;
> > +import java.io.ByteArrayOutputStream;
> >  import java.io.IOException;
> >  import java.io.InputStream;
> > +import java.io.OutputStreamWriter;
> >  import java.text.ParseException;
> >  import java.util.Properties;
> >
> >  import org.apache.lucene.benchmark.byTask.utils.Config;
> > +import org.apache.lucene.util.IOUtils;
> >  import org.apache.lucene.util.LuceneTestCase;
> > -import org.junit.Ignore;
> >  import org.junit.Test;
> >
> >  public class EnwikiContentSourceTest extends LuceneTestCase { @@ -38,10
> > +41,16 @@ public class EnwikiContentSourceTest ext
> >this.docs = docs;
> >  }
> >
> > -@SuppressWarnings("deprecation") // fine for the characters used in
> this
> > test
> >  @Override
> >  protected InputStream openInputStream() throws IOException {
> > -  return new java.io.StringBufferInputStream(docs);
> > +  // StringBufferInputStream would have been handy, but it is
> forbidden
> > +  ByteArrayOutputStream baos = new ByteArrayOutputStream();
> > +  OutputStreamWriter w = new OutputStreamWriter(baos,
> > IOUtils.CHARSET_UTF_8);
> > +  w.write(docs);
> > +  w.close();
> > +  byte[] byteArray = baos.toByteArray();
> > +  baos.close();
> > +  return new ByteArrayInputStream(byteArray);
> >  }
> >
> >}
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [JENKINS] Lucene-Solr-trunk-Windows (64bit/jdk1.6.0_37) - Build # 2074 - Still Failing!

2012-12-06 Thread Doron Cohen

Fixed, sorry about that.


On Thu, Dec 6, 2012 at 3:28 PM, Policeman Jenkins Server <
jenk...@sd-datasolutions.de> wrote:

> Build:
> http://jenkins.sd-datasolutions.de/job/Lucene-Solr-trunk-Windows/2074/
> Java: 64bit/jdk1.6.0_37 -XX:+UseSerialGC
>
> All tests passed
>
> Build Log:
> [...truncated 19501 lines...]
> -check-forbidden-jdk-apis:
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\executors.txt
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\jdk-deprecated.txt
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\jdk.txt
> [forbidden-apis] Loading classes to check...
> [forbidden-apis] Scanning for API signatures and dependencies...
> [forbidden-apis] Forbidden class use: java.io.StringBufferInputStream
> [forbidden-apis]   in
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest$StringableEnwikiSource
> (EnwikiContentSourceTest.java:44)
> [forbidden-apis] Scanned 5448 (and 420 related) class file(s) for
> forbidden API invocations (in 9.38s), 1 error(s).
>
> BUILD FAILED
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:67:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\build.xml:163:
> Check for forbidden API calls failed, see log.
>
> Total time: 53 minutes 0 seconds
> Build step 'Invoke Ant' marked build as failure
> Archiving artifacts
> Recording test results
> Description set: Java: 64bit/jdk1.6.0_37 -XX:+UseSerialGC
> Email was triggered for: Failure
> Sending email for trigger: Failure
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

Re: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.7.0_09) - Build # 2073 - Still Failing!

2012-12-06 Thread Doron Cohen

That's because of my test code, using StringBufferInputStream.
Seems ok for that particular test, very convenient at that point...
I'll replace it.
Doron

On Thu, Dec 6, 2012 at 1:27 PM, Policeman Jenkins Server <
jenk...@sd-datasolutions.de> wrote:

> Build:
> http://jenkins.sd-datasolutions.de/job/Lucene-Solr-trunk-Windows/2073/
> Java: 32bit/jdk1.7.0_09 -client -XX:+UseG1GC
>
> All tests passed
>
> Build Log:
> [...truncated 20187 lines...]
> -check-forbidden-jdk-apis:
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\executors.txt
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\jdk-deprecated.txt
> [forbidden-apis] Reading API signatures:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\tools\forbiddenApis\jdk.txt
> [forbidden-apis] Loading classes to check...
> [forbidden-apis] Scanning for API signatures and dependencies...
> [forbidden-apis] Forbidden class use: java.io.StringBufferInputStream
> [forbidden-apis]   in
> org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSourceTest$StringableEnwikiSource
> (EnwikiContentSourceTest.java:44)
> [forbidden-apis] Scanned 5448 (and 423 related) class file(s) for
> forbidden API invocations (in 19.96s), 1 error(s).
>
> BUILD FAILED
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:67:
> The following error occurred while executing this line:
> C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\lucene\build.xml:163:
> Check for forbidden API calls failed, see log.
>
> Total time: 51 minutes 40 seconds
> Build step 'Invoke Ant' marked build as failure
> Archiving artifacts
> Recording test results
> Description set: Java: 32bit/jdk1.7.0_09 -client -XX:+UseG1GC
> Email was triggered for: Failure
> Sending email for trigger: Failure
>
>

[jira] [Commented] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-06 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511262#comment-13511262
 ] 

Doron Cohen commented on LUCENE-4590:
-

bq. Do you think perhaps that EnwikiContentSource should let the caller know 
whether the returned DocData represents a content page or category page?

That's what I planned at start, but decided to leave WriteLineDoc intact 
because it is general, that is, not aware of the unique structure of Wikipedia 
data, where some of the pages represent categories.

bq. So maybe, if someone wants to generate a line file from the pages only... 
flexibility that I think you are trying to achieve...

Actually I am after the two files... :) These category pages are (unique) 
taxonomy node names, but without the taxonomy structure, which can be deduced 
from the (parent) categories of the category pages. Having this separate 
category pages can be useful for deducing that taxonomy.

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-06 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-4590:


Component/s: modules/benchmark

> WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file
> ---
>
> Key: LUCENE-4590
> URL: https://issues.apache.org/jira/browse/LUCENE-4590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
>
> It may be convenient to split Wikipedia's line file into two separate files: 
> category-pages and non-category ones. 
> It is possible to split the original line file with grep or such.
> It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4590) WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

2012-12-06 Thread Doron Cohen (JIRA)

Doron Cohen created LUCENE-4590:
---

 Summary: WriteEnwikiLineDoc which writes Wikipedia category pages 
to a separate file
 Key: LUCENE-4590
 URL: https://issues.apache.org/jira/browse/LUCENE-4590
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor


It may be convenient to split Wikipedia's line file into two separate files: 
category-pages and non-category ones. 
It is possible to split the original line file with grep or such.
It is more efficient to do it in advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-06 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-4588:
---

Assignee: Doron Cohen

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4588.patch
>
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-05 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-4588:


Attachment: LUCENE-4588.patch

Patch adds a test for enwiki-content-source and fixes both the last doc problem 
and the thread leak.

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>    Reporter: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-4588.patch
>
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-05 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510774#comment-13510774
 ] 

Doron Cohen commented on LUCENE-4588:
-

In addition, there's a thread leak in 'forever' mode.

> EnwikiContentSource silently swallows the last wiki doc
> ---
>
> Key: LUCENE-4588
> URL: https://issues.apache.org/jira/browse/LUCENE-4588
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Priority: Minor
>
> Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4588) EnwikiContentSource silently swallows the last wiki doc

2012-12-05 Thread Doron Cohen (JIRA)

Doron Cohen created LUCENE-4588:
---

 Summary: EnwikiContentSource silently swallows the last wiki doc
 Key: LUCENE-4588
 URL: https://issues.apache.org/jira/browse/LUCENE-4588
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/benchmark
Reporter: Doron Cohen
Priority: Minor


Last wiki doc is never returned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226623#comment-13226623
 ] 

Doron Cohen commented on LUCENE-3821:
-

Committed:
- r1299077  3x
- r1299112  trunk

bq. I would be glad to try out a nightly build with the patch as is against our 
tests - I would be glad to get the 80% solution if it fixes my bug.

It's in now...

bq.  But I wonder if we can re-use even some of the math to redefine the 
problem more formally to figure out what minimal state/lookahead we need for 
example...

Robert, this gave me an idea... currently, in case of "collision" between 
repeaters, we compare them and advance the "lesser" of them 
(SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be 
fairly easy to add lookahead to this logic: if one of the two is multi-term, 
lesser can also do a lookahead. The amount of lookahead can depend on the slop. 
I'll give it a try before closing this issue.


> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-09 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226494#comment-13226494
 ] 

Doron Cohen commented on LUCENE-3821:
-

{quote}
Not understanding really how SloppyPhraseScorer works now, but not trying to 
add confusion to the issue, I can't help but think this problem is a variant on 
LevensteinAutomata... in fact that was the motivation for the new test, i just 
stole the testing methodology from there and applied it to this!
{quote}

Interesting! I was not aware of this. I went and read some about this 
automaton, It is relevant.

{quote}
It seems many things are the same but with a few twists:

* fundamentally we are interleaving the streams from the subscorers into the 
'index automaton'
'query automaton' is produced from the user-supplied terms
{quote}

True. In fact, the current code works hard to decide on the "correct 
interleaving order" - while if we had a "Perfect Levenstein Automaton" that 
took care of the computation we could just interleave, in the term position 
order (forget about phrase position and all that) and let the automaton compute 
the distance. 

This might capture the difficulty in making the sloppy phrase scorer correct: 
it started with the algorithm that was optimized for exact matching, and 
adopted (hacked?) it for approximate matching.

Instead, starting with a model that fits approximate matching, might be easier 
and cleaner. I like that. 

{quote}
* our 'alphabet' is the terms, and holes from position increment are just an 
additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they 
are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata 
more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it 
will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what 
minimal state/lookahead we need for example...
{quote}

I agree. I'll think of this.

In the meantime I'll commit this partial fix.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700
 ] 

Doron Cohen commented on LUCENE-3821:
-

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we 
propagate the smaller of them (smaller in its phrase-position, which takes into 
account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document 
"a b", because both "a"'s (query terms) will land on the same document's "a". 
This is illegal and is prevented by such propagation. 

But when one of the repeating terms is a multi-term, it is not possible to know 
which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different 
scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same 
phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that 
when two PPs with different query offset are in the same phrase-position, their 
doc-position cannot be the same, and therefore no special handling is needed 
for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in 
certain posIndex, different phrase-positions. This is because of the slop. So 
they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help 
when someone wants to get into this.

Back to option 2.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223629#comment-13223629
 ] 

Doron Cohen commented on LUCENE-3821:
-

bq. sounds interesting: ExactPhraseScorer really has a lot of useful recent 
heuristics and optimizations, especially about when to next() versus advance() 
and such?

next()/advance() will remain, but it would still be more costly than exact - 
score cache won't play, because freqs really are float in this case, and also 
there would be more computations on the way. But let's see it working first...

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3821:


Attachment: LUCENE-3821-SloppyDecays.patch

Patch adds NonExactPhraseScorer (temporary name) as discussed above - work in 
progress, it does not yet do any sloppy matching or scoring.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>    Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223554#comment-13223554
 ] 

Doron Cohen commented on LUCENE-3821:
-

OK great! 

If you did not point a problem with this up front there's a good chance it will 
work and I'd like to give it a try. 

I have a first patch - not working or anything - it opens ExactPhraseScorer a 
bit for extensions and adds a class (temporary name) - NonExactPhraseScorer. 

The idea is to hide in the ChunkState the details of decaying frequencies due 
to slops. I will try it over the weekend. If we can make it this way, I'd 
rather do it in this issue rather than committing the other new code for the 
fix and then replacing it. If that won't go quick, I'll commit the (other) 
changes to SloppyPhraseScorer and start a new issue.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-06 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223077#comment-13223077
 ] 

Doron Cohen commented on LUCENE-3821:
-

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current 
ExactPhraseScorer, for both performance and correctness reasons. 

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the 
conditions for a match (still) holds.  

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*Sat posIndex
1 + N*S - 1at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as 
the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, 
before adding up 1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] >= k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] >= k  

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it 
allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when 
there is a match, several posIndexes will contribute to the score of the same 
match. I think this is not too bad, as for these parameters, same behavior 
would be in all documents. I would be especially forgiving for this if we this 
way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and 
consider the performance effect. The tricky part is to increment at posIndex-n. 
Say there are 3 terms in the query and one of the terms is found at indexes 10, 
15, 18. Assume the slope is 2. Since N=3, the max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos   Increment
---   -
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we 
do not know yet about the contribution of posIndex 18, which is 6, and should 
be used instead of 5. So some look-ahead (or some fix-back) is required, which 
will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted 
to get some feedback first.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-05 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3821:


Attachment: LUCENE-3821.patch

Attached updated patch. 

Repeating PPs with multi-Phrase-query is handled as well.

This called for more cases in the sloppy phrase scorer and more code, and, 
although I think the code is cleaner now, I don't know to what extent is it 
easier to maintain. 

It definitely fixes wrong behavior that exists in current 3x and trunk (patch 
is for 3x).

However, although the random test passes for me even with -Dtests.iter=2000, it 
is possible to "break the scorer" - that is, create a document and a query 
which should match each other but would not. 

The patch adds just such a case as an @Ignored test case:  
TestMultiPhraseQuery.testMultiSloppyWithRepeats(). 

I don't see how to solve this specific case in the context of current sloppy 
phrase scorer. 

So there are 3 options:
# leave things as they are
# commit this patch and for now document the failing scenario (also keep it in 
the ignored test case). 
# devise a different algorithm for this.

I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 
2nd, just need to keep in mind that the random test might from time to time 
create this scenario and so there will be noise in the test builds.

Preferences?

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221879#comment-13221879
 ] 

Doron Cohen commented on LUCENE-3821:
-

I think I understand the cause.

In current implementation there is an assumption that once we landed on the 
first candidate document, it is possible to check if there are repeating pps, 
by just comparing the in-doc-positions of the terms. 

Tricky as it is, while this is true for plain PhrasePositions, it is not true 
for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first 
candidate document that starts with "a b". The in-doc-positions of the two pps 
would be 0,1 respectively (for 'a' and 'b') and we would not even detect the 
fact that there are repetitions, not to mention not putting them in the same 
group.

MPPs conflicts with current patch in an additional manner: It is now assumed 
that each repetition can be assigned a repetition group. 

So assume these PPs (and query positions): 
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, 
while 7:c is not a repetition.

But assume these PPs (and query positions): 
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, 
scorer logic would still work, as a collision is decided only when two pps are 
in the same in-doc-position. The only impact of MPPs would be performance cost: 
since repetition groups are larger, it would take longer to check if there are 
repetitions.

Just need to figure out how to detect repetition groups without relying on 
in-(first-)doc-positions.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3821:


Attachment: LUCENE-3821.patch

updated patch with fixed MFQ.toString(), which prints the problematic doc and 
queries in case of failure.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>    Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221867#comment-13221867
 ] 

Doron Cohen commented on LUCENE-3821:
-

Update: apparently MultiPhraseQuery.toString does not print its "holes".

So the query that failed was not:
{noformat}field:"(j o s) (i b j) (t d)"{noformat}

But rather:
{noformat}"(j o s) ? (i b j) ? ? (t d)"{noformat}

Which is a different story: this query should match the document
{noformat}s o b h j t j z o{noformat}

There is a match for ExactPhraseScorer, but not for Sloppy with slope 1.
So there is still work to do on SloppyPhraseScorer...

(I'll fix MFQ.toString() as well)

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-04 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221840#comment-13221840
 ] 

Doron Cohen commented on LUCENE-3821:
-

The remaining failure still happens with the updated patch (same seed), and 
still seems to me like an ExactPhraseScorer bug. 

However, it is probably not a simple one I think, because when adding to 
TestMultiPhraseQuery, it passes, that is, no documents are matched, as 
expected, although this supposedly created the exact scenario that failed 
above. 

Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so 
far.

{code:title=Add this to TestMultiPhraseQuery}
  public void test_LUCENE_XYZ() throws Exception {
Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random, indexStore);
add("s o b h j t j z o", "LUCENE-XYZ", writer);

IndexReader reader = writer.getReader();
IndexSearcher searcher = newSearcher(reader);

MultiPhraseQuery q = new MultiPhraseQuery();
q.add(new Term[] {new Term("body", "j"), new Term("body", "o"), new 
Term("body", "s")});
q.add(new Term[] {new Term("body", "i"), new Term("body", "b"), new 
Term("body", "j")});
q.add(new Term[] {new Term("body", "t"), new Term("body", "d")});
assertEquals("Wrong number of hits", 0,
searcher.search(q, null, 1).totalHits);

// just make sure no exc:
searcher.explain(q, 0);

writer.close();
searcher.close();
reader.close();
indexStore.close();
  }
{code}

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-03 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3821:


Attachment: LUCENE-3821.patch

bq. Hmm patch has this: ... import backport.api...

Oops, here's a fixed patch, also added some comments, and removed the @Ignore 
from the test

bq. I'm going to be ecstatic if that crazy test finds bugs both in exact and 
sloppy phrase scorers :)

It is a great test! Interestingly one thing it exposed is the dependency of the 
SloppyPhraseScorer in the order of PPs in PhraseScorer when phraseFreq() is 
invoked. The way things work in the super class, that order depends on the 
content of previously processed documents. This fix removes that wrong 
dependency, of course. The point is that deliberately devising a test that 
exposes such a bug seems almost impossible: first you need to think about such 
a case, and if you did, writing a test that would create this specific scenario 
is buggy by itself. Praise to random testing, and this random test in 
particular.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-03 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3821:


Attachment: LUCENE-3821.patch

Patch with fix for this problem. I would expect SloppyPhrase scoring 
performance to degrade, though I did not measure it.

The single test that still fails (and I think the bug is in ExactPhraseScorer) 
is testRandomIncreasingSloppiness, and can be recreated like this:
{noformat}
ant test -Dtestcase=TestSloppyPhraseQuery2 
-Dtestmethod=testRandomIncreasingSloppiness 
-Dtests.seed=47267613db69f714:-617bb800c4a3c645:-456a673444fdc184 
-Dargs="-Dfile.encoding=UTF-8"
{noformat}

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, 
> solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-03 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221737#comment-13221737
 ] 

Doron Cohen commented on LUCENE-3821:
-

I understand the problem. 

It all has to do - as Robert mentioned - with the repeating terms in the phrase 
query. I am working on a patch - it will change the way that repeats are 
handled. 

Repeating PPs require additional computation - and current SloppyPhraseScorer 
attempts to do that additional work efficiently, but over simplifies in that 
and fail to cover all cases. 

In the core of things, each time a repeating PP is selected (from the queue) 
and  propagated, *all* its sibling repeaters are propagated as well, to prevent 
a case that two repeating PPs point to the same document position (which was 
the bug that originally triggered handling repeats in this code). 

But this is wrong, because it propagates all siblings repeaters, and misses 
some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this 
comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443])
 ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests 
pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i 
am missing something. 

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 
shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 
shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 
shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 
shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 
shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document>
{noformat}

It seems that q1 too should not match this document?

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>    Reporter: Naomi Dushay
>Assignee: Doron Cohen
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-03-01 Thread Doron Cohen (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-3821:
---

Assignee: Doron Cohen

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
>    Assignee: Doron Cohen
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

2012-02-29 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219156#comment-13219156
 ] 

Doron Cohen commented on LUCENE-3821:
-

Fails here too like this: 

ant test -Dtestcase=TestSloppyPhraseQuery2 
-Dtestmethod=testRandomIncreasingSloppiness 
-Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 
-Dargs="-Dfile.encoding=UTF-8"

I'll look into it

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.5, 4.0
>Reporter: Naomi Dushay
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-06 Thread Doron Cohen (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-3746.
-

   Resolution: Fixed
 Assignee: Doron Cohen
Lucene Fields: Patch Available  (was: New)

Committed:
- r1241355 - trunk
- r1241363 - 3x

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-05 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201073#comment-13201073
 ] 

Doron Cohen commented on LUCENE-3746:
-

Thanks Dawid! 

{quote}
it's probably a system daemon thread for sending memory threshold notifications
{quote}

Yes this makes sense. 
Still the difference between the two JDKs felt bothering.
Some more digging, and now I think it is clear. 

Here are the stack traces reported (at the end of the test) with Oracle:
{noformat}
1.  Thread[ReaderThread,5,main]
2.  Thread[main,5,main]
3.  Thread[Reference Handler,10,system]
4.  Thread[Signal Dispatcher,9,system]
5.  Thread[Finalizer,8,system]
6.  Thread[Attach Listener,5,system]
{noformat}

And with IBM JDK:
{noformat}
1.  Thread[Attach API wait loop,10,main]
2.  Thread[Finalizer thread,5,system]
3.  Thread[JIT Compilation Thread,10,system]
4.  Thread[main,5,main]
5.  Thread[Gc Slave Thread,5,system]
6.  Thread[ReaderThread,5,main]
7.  Thread[Signal Dispatcher,5,main]
8.  Thread[MemoryPoolMXBean notification dispatcher,6,main]
{noformat}

The 8th thread is the one that started only after accessing the memory 
management layer. The thing is, that in the IBM JDK that thread is created in 
the ThreadGroup "main", while in the Oracle JDK it is created under "system". 
To me the latter makes more sense. 

To be more sure I added a fake memory notification listener and check the 
thread in which notification happens: 
{code}
MemoryMXBean mmxb = ManagementFactory.getMemoryMXBean();
NotificationListener listener = new NotificationListener() {
  @Override
  public void handleNotification(Notification notification, Object handback) {
System.out.println(Thread.currentThread());
  }
};
((NotificationEmitter) mmxb).addNotificationListener(listener, null, null);
{code}

Evidently in IBM JDK the notification is in "main" group thread (also in line 
with the thread-group in the original warning message which triggered this 
threads discussion):
{noformat}
Thread[MemoryPoolMXBean notification dispatcher,6,main]
{noformat}

While in Oracle JDK notification is in "system" group thread:
{noformat}
Thread[Low Memory Detector,9,system]
{noformat}

This also explains why the warning is reported only for IBM JDK: because the 
threads check in LTC only account for the threads in the same thread-group as 
the one running the specific test case. So when dispatching happens in a 
"system" group thread it is not sensed by that check at all.

Ok now with mystery solved I can commit the simpler code...

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>Reporter: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-05 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3746:


Attachment: LUCENE-3746.patch

Updated patch - without MemoryMXBean - computing 'max, total, free' (in that 
order) and deciding by 'free' or falling to 'max-free'. This is more 
conservative, than MemoryMxBean but since the latter is not full proof either, 
I prefer the simpler approach. 

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>Reporter: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-05 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3746:


Attachment: LUCENE-3746.patch

Updated patch using ManagementFactory.getMemoryMXBean().getHeapMemoryUsage(). 

Javadocs are not explicit about this call being atomic, but from the wording it 
seems almost certain to conclude that each call returns a new Usage instance. 
In this patch this is (Java) asserted and the assert passes (-ea) in two 
different JVMs - IBM and Oracle - so this might be correct. I searched some 
more explicit info on this with no success. 

Annoyingly though, in IBM JDK, running the tests like this produces the nice 
warning:

{noformat}
WARNING: test class left thread running: Thread[MemoryPoolMXBean notification 
dispatcher,6,main]
RESOURCE LEAK: test class left 1 thread(s) running
{noformat}

This makes me reluctant to use the memory bean - I did not find a way to 
prevent that thread leak.

So perhaps a better approach would be to be conservative about the sequence of 
calls when using Runtime? something like this:

{code}
long free = rt.freeMemory();
if (free is sufficient)
  return decideBy(free);
long max = rt.maxMemory();
long total = rt.totalMemory();
return decideBy(max - total)
{code}

This is conservative in that 'total' is computed last, and in that total-free 
is not added to the computed available bytes.

In both approaches, even if atomicity is guaranteed, it is possible that more 
heap is allocated in another thread between the time that the size is computed, 
to the time that the bytes are actually allocated, so not sure how safe this 
check can be made.

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>Reporter: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch, LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-02 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199038#comment-13199038
 ] 

Doron Cohen commented on LUCENE-3746:
-

{quote}
[Dawid:|http://markmail.org/message/jobtemqm4u4vrxze] (maxMemory - totalMemory) 
because that's how much the heap can
grow? The problem is none of this is atomic, so the result can
unpredictable. There are other methods in management interface that
permit a somewhat more detailed checks.  Don't know if they guarantee
atomicity of the returned snapshot, but I doubt it.
- 
[MemoryMXBean.getHeapMemoryUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryMXBean.html#getHeapMemoryUsage()]
- 
[MemoryPoolMXBean.getPeakUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryPoolMXBean.html#getPeakUsage()]
{quote}

Current patch not (yet) handling the atomicity issue Dawid described. 

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>Reporter: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-02 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3746:


Attachment: LUCENE-3746.patch

Simple fix: consult also with maxMemory if freeMemory not suffice.

> suggest.fst.Sort.BufferSize should not automatically fail just because of 
> freeMemory()
> --
>
> Key: LUCENE-3746
> URL: https://issues.apache.org/jira/browse/LUCENE-3746
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/spellchecker
>    Reporter: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3746.patch
>
>
> Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM 
> buffer is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: FSTCompletionTest failure "At least 0.5MB RAM buffer is needed"

2012-02-02 Thread Doron Cohen

Great, I thought so too just waited for a other opinions and almost forgot
about it!
Created https://issues.apache.org/jira/browse/LUCENE-3746
Doron

On Thu, Feb 2, 2012 at 6:08 PM, Dawid Weiss wrote:

> long freeHeap = Runtime.getRuntime().freeMemory();
>
> Indeed, this doesn't look right; it'd have to be used in combination
> with (maxMemory - totalMemory) because that's how much the heap can
> grow? The problem is none of this is atomic, so the result can
> unpredictable. There are other methods in management interface that
> permit a somewhat more detailed checks. Don't know if they guarantee
> atomicity of the returned snapshot, but I doubt it.
>
>
> http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryMXBean.html#getHeapMemoryUsage()
>
> http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryPoolMXBean.html#getPeakUsage()
>
> Dawid
>
> On Thu, Feb 2, 2012 at 4:58 PM, Robert Muir  wrote:
> > Doron, this sounds like something we should fix: I think we should
> > open a JIRA issue for it.
> >
> > On Mon, Jan 30, 2012 at 12:34 PM, Doron Cohen  wrote:
> >> Hi, this test consistently fails on Windows with an IBM JDK, with this
> >> error:
> >>
> >>> java.lang.IllegalArgumentException: At least 0.5MB RAM buffer is
> needed:
> >>> 432472
> >>>  at org.apache.lucene.search.suggest.fst.Sort.(Sort.java:159)
> >>>  at org.apache.lucene.search.suggest.fst.Sort.(Sort.java:150)
> >>>  at
> >>>
> org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:181)
> >>>  at
> >>>
> org.apache.lucene.search.suggest.fst.FSTCompletionTest.testLargeInputConstantWeights(FSTCompletionTest.java:164)
> >>>
> >>> NOTE: reproduce with: ant test -Dtestcase=FSTCompletionTest
> >>> -Dtestmethod=testLargeInputConstantWeights
> >>>
> -Dtests.seed=63069fe8e90d25f1:-4459dd4f7ddf2b26:71f954eeb3888217 -Dargs="->
> >>> Dfile.encoding=UTF-8"
> >>> NOTE: test params are: locale=et_EE,
> timezone=America/Argentina/La_Rioja
> >>> NOTE: all tests run in this JVM:
> >>> [FSTCompletionTest]
> >>> NOTE: Windows 7 6.1 build 7601 Service Pack 1 amd64/IBM Corporation
> 1.6.0
> >>> (64-bit)/cpus=2,threads=4,free=330928,total=6291456
> >>
> >> The memory provided to sort is computed in
> >> contrib/spell/.../suggest.fst.Sort.automatic():
> >>
> >> {code}
> >> public static BufferSize automatic() {
> >>   long freeHeap = Runtime.getRuntime().freeMemory();
> >>   return new BufferSize(Math.min(MIN_BUFFER_SIZE_MB * MB, freeHeap /
> >> 2));
> >> }
> >> {code}
> >>
> >> With Oracle's Java 6 the test passed.
> >>
> >> With IBM JDK, the fails even with -Xmx700m.   (Allow allocating at most
> >> 177M.)
> >> But It will pass with just -Xms10m.   (Allow allocating 10M at start.)
> >>
> >> So, if in a certain moment in a JVM's life the currently allocated
> memory is
> >> almost exhausted, Sort will fail, even if the settings in effect allow
> to
> >> allocate more heap.
> >>
> >> It seems "nice" that Sort attempts to behave "nice" - use as much as
> half of
> >> the currently free heap.
> >> This makes sense.
> >> But perhaps in the situation that there's not enough free memory but
> the max
> >> memory settings allow to allocate more, a reasonable minimum should be
> >> returned, even the minimum of 0.5M. This will cause additional memory
> >> allocation for heap, but I think in this case it is justified?
> >>
> >> Doron
> >
> >
> >
> > --
> > lucidimagination.com
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Created] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()

2012-02-02 Thread Doron Cohen (Created) (JIRA)

suggest.fst.Sort.BufferSize should not automatically fail just because of 
freeMemory()
--

 Key: LUCENE-3746
 URL: https://issues.apache.org/jira/browse/LUCENE-3746
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/spellchecker
Reporter: Doron Cohen
 Fix For: 3.6, 4.0


Follow up op dev thread: [FSTCompletionTest failure "At least 0.5MB RAM buffer 
is needed" | http://markmail.org/message/d7ugfo5xof4h5jeh]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [DISCUSS] New Website

2012-02-01 Thread Doron Cohen

On Wed, Feb 1, 2012 at 2:14 PM, Doron Cohen  wrote:

> HI Grant,
>
>> PS I guess it is me to blame not the new site - that somewhat grayed-out
>> text is harder to read - but there are many modern sites like this so will
>> have to learn to live with it.
>>
>>
>> Clarification please?
>>
>>
> I wished to make a too big deal of it...   the text I see (tried 3
> browsers) is a bit Gray (or faded), and I rather read a page with more
> contrast between the text and the background, more clear to my eyes this
> way, but this is just my personal preference, it seems others do not feel
> that way, so I'm okay with this.  Hopefully this is more clear now?
>

Sorry for the spam, I meant to write "I wished not to make a too big deal
of it... "

Re: [DISCUSS] New Website

2012-02-01 Thread Doron Cohen

>
> - Page icon (shown in the address bar) is that of Solr - should be that of
>> Lucene.
>>
>> ?
>>
>
> I meant this little icon: (hope it will show in the email to the list, if
> not I can send you privately).
>

That icon comes from here:
http://lucene.staging.apache.org/images/favicon.ico

Re: [DISCUSS] New Website

2012-02-01 Thread Doron Cohen

HI Grant,

> - Page icon (shown in the address bar) is that of Solr - should be that of
> Lucene.
>
> ?
>

I meant this little icon: (hope it will show in the email to the list, if
not I can send you privately).
[image: image.png]


> PS I guess it is me to blame not the new site - that somewhat grayed-out
> text is harder to read - but there are many modern sites like this so will
> have to learn to live with it.
>
>
> Clarification please?
>
>
I wished to make a too big deal of it...   the text I see (tried 3
browsers) is a bit Gray (or faded), and I rather read a page with more
contrast between the text and the background, more clear to my eyes this
way, but this is just my personal preference, it seems others do not feel
that way, so I'm okay with this.  Hopefully this is more clear now?

Doron
<>

Re: [DISCUSS] New Website

2012-02-01 Thread Doron Cohen

Wow this is impressive!


> so I'd like to propose we make the leap and switch.
>

+ 1

The new site is almost dead simple to edit:
> ...
> The whole process is 1000x easier than Forrest.  You can even edit via the
> web using a WYSWIG editor if you so desire.
>

This by itself is a major reason to move.

!!!*
> Now, here's the kicker, I'd like to switch to the new sites over this
> weekend or early next week at the latest and then iterate from there.  I
> don't have a solution for the Core versioned docs yet, but those can be
> figured out later.  Presumably we can convert them to markdown at some
> point before the next release and just maintain pointers to the old
> versions.
> !!!*
>

Like others said, we can handle this later.

Minor comments for now:
- The "Lucene" icon at top left still links to old site
- Page icon (shown in the address bar) is that of Solr - should be that of
Lucene.

Thanks for doing this!
Doron

PS I guess it is me to blame not the new site - that somewhat grayed-out
text is harder to read - but there are many modern sites like this so will
have to learn to live with it.

[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix

2012-01-31 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197014#comment-13197014
 ] 

Doron Cohen commented on LUCENE-3737:
-

Yes, only saw this on trunk, thanks for taking care of this!

> Idea modules settings - verify and fix
> --
>
> Key: LUCENE-3737
> URL: https://issues.apache.org/jira/browse/LUCENE-3737
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>    Reporter: Doron Cohen
>Assignee: Steven Rowe
>Priority: Trivial
> Fix For: 4.0
>
>
> Idea's settings for modules/queries and modules/queryparser refer to 
> lucene/contrib instead of modules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix

2012-01-31 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196858#comment-13196858
 ] 

Doron Cohen commented on LUCENE-3737:
-

In dev-tools/idea/.idea/ant.xml there are these two:

{code}


{code}

I assume this has the potential to break an Idea setup, but haven't tried it 
yet, just wanted to not forget about it, therefore this issue. Is this a 
none-issue?


> Idea modules settings - verify and fix
> --
>
> Key: LUCENE-3737
> URL: https://issues.apache.org/jira/browse/LUCENE-3737
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Trivial
>
> Idea's settings for modules/queries and modules/queryparser refer to 
> lucene/contrib instead of modules.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-31 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196845#comment-13196845
 ] 

Doron Cohen commented on LUCENE-1812:
-

while merging to trunk I noticed that idea's settings for modules/queries and 
modules/queryparser refer to lucene/contrib instead of modules. Seems trivial 
to fix but I have no Idea installed at the moment so no way to verify. Created 
LUCENE-3737 to handle that later.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3737) Idea modules settings - verify and fix

2012-01-31 Thread Doron Cohen (Created) (JIRA)

Idea modules settings - verify and fix
--

 Key: LUCENE-3737
 URL: https://issues.apache.org/jira/browse/LUCENE-3737
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Trivial


Idea's settings for modules/queries and modules/queryparser refer to 
lucene/contrib instead of modules.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196429#comment-13196429
 ] 

Doron Cohen commented on LUCENE-1812:
-

bq. Excellent, thanks for seeing this through!

Yeah, only more than a year delay ;)

BTW in trunk it will be under modules.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196339#comment-13196339
 ] 

Doron Cohen commented on LUCENE-1812:
-

That dead code was removed and some javadocs added. 
Still room for more javadocs - e.g.  the static tool - and better test coverage.
Committed to 3x: r1237937.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-30 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1812:


Attachment: pruning.patch

Updated patch: package.html and all pruning classes moved to another package, 
except for PruningReader. Now ant javadocs-all passes as well. There are 3 
TODO's:
# implement CarmelTermPruningDeltaTopPolicy
# dead code question in CarmelUniformTermPruningPolicy
# missing details in package.html

The first one can wait but the other two I would like to handle before 
committing.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

FSTCompletionTest failure "At least 0.5MB RAM buffer is needed"

2012-01-30 Thread Doron Cohen

Hi, this test consistently fails on Windows with an IBM JDK, with this
error:

> java.lang.IllegalArgumentException: At least 0.5MB RAM buffer is needed:
432472
>  at org.apache.lucene.search.suggest.fst.Sort.(Sort.java:159)
>  at org.apache.lucene.search.suggest.fst.Sort.(Sort.java:150)
>  at
org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:181)
>  at
org.apache.lucene.search.suggest.fst.FSTCompletionTest.testLargeInputConstantWeights(FSTCompletionTest.java:164)
>
> NOTE: reproduce with: ant test -Dtestcase=FSTCompletionTest
-Dtestmethod=testLargeInputConstantWeights
-Dtests.seed=63069fe8e90d25f1:-4459dd4f7ddf2b26:71f954eeb3888217 -Dargs="->
Dfile.encoding=UTF-8"
> NOTE: test params are: locale=et_EE, timezone=America/Argentina/La_Rioja
> NOTE: all tests run in this JVM:
> [FSTCompletionTest]
> NOTE: Windows 7 6.1 build 7601 Service Pack 1 amd64/IBM Corporation 1.6.0
(64-bit)/cpus=2,threads=4,free=330928,total=6291456

The memory provided to sort is computed in
contrib/spell/.../suggest.fst.Sort.automatic():

{code}
public static BufferSize automatic() {
  long freeHeap = Runtime.getRuntime().freeMemory();
  return new BufferSize(Math.min(MIN_BUFFER_SIZE_MB * MB, freeHeap /
2));
}
{code}

With Oracle's Java 6 the test passed.

With IBM JDK, the fails even with -Xmx700m.   (Allow allocating at most
177M.)
But It will pass with just -Xms10m.   (Allow allocating 10M at start.)

So, if in a certain moment in a JVM's life the currently allocated memory
is almost exhausted, Sort will fail, even if the settings in effect allow
to allocate more heap.

It seems "nice" that Sort attempts to behave "nice" - use as much as half
of the currently free heap.
This makes sense.
But perhaps in the situation that there's not enough free memory but the
max memory settings allow to allocate more, a reasonable minimum should be
returned, even the minimum of 0.5M. This will cause additional memory
allocation for heap, but I think in this case it is justified?

Doron

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192101#comment-13192101
 ] 

Doron Cohen commented on LUCENE-1812:
-

I ran 'javadocs' under 3x/lucene/contrib/pruning and 'javadocs-all' under 
3x/lucene. 

The latter failed due to multiple package.html under o.a.l.index - in core and 
under contrib/pruning. 

Entirely renaming the package to o.a.l.pruning.index won't work because 
PruningReader accesses package protected SegmentTermVector.

I can move the other classes to that new package and keep only PruningReader in 
that "index friend" package. (Unless there are javadoc/ant tricks that will 
avoid this error and still generate valid javadocs in both cases).


> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-3718.
-

   Resolution: Fixed
Fix Version/s: (was: 3.6)
Lucene Fields: Patch Available  (was: New)

Fixed.

> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
> Fix For: 4.0
>
> Attachments: LUCENE-3718.patch, LUCENE-3718.patch
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192046#comment-13192046
 ] 

Doron Cohen commented on LUCENE-3718:
-

Fix committed in r1235190 (trunk).
Added no CHANGES entry - seems to me an overkill here... other opinions?


> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3718.patch, LUCENE-3718.patch
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3718:


Attachment: LUCENE-3718.patch

updated patch with same fix also in AllDocsSegmentDocsEnum.linearScan() 
(previous patch fixed only LiveDocsSegmentDocsEnum.linearScan()).

I also verified that this facets test does not fail in 3x with same seed.

> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3718.patch, LUCENE-3718.patch
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3718:


Attachment: LUCENE-3718.patch

Attached simple fix to Lucene40PostingsReader: linearScan() should set doc also 
when returning refill().

> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>    Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3718.patch
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-24 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192031#comment-13192031
 ] 

Doron Cohen commented on LUCENE-3718:
-

well this is not a test bug after all, but rather exposing a bug in 
Lucene40PostingsReader.

> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Problem: High Term Frequency After Search

2012-01-24 Thread Doron Cohen

Faceted search module may also be useful here - it has a programmatic API
and different capabilities and approach from the faceted search solution in
solr. There was a problem in the javadoc in 3.5 so this documentation is
somewhat hidden (fixed for furure 3.6). See this:
http://lucene.apache.org/java/3_5_0/api/contrib-facet/index.html and note
the link in that page to the user guide.

Doron

On Wed, Jan 18, 2012 at 11:04 PM, colbacc8  wrote:

>
> colbacc8 wrote
> >
> > The solution is facet searchyou have reason.
> >
> >
> > Do you have any document for understand this type of search in lucene?
> >
>
> The facet search is the reason whi I have to group documents into 5
> categories. These 5 categories are dynamic and are created based on the 5
> most important keywords in the documents resulting from the research
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-High-Term-Frequency-After-Search-tp3651188p3670475.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 12231 - Failure

2012-01-23 Thread Doron Cohen

Created https://issues.apache.org/jira/browse/LUCENE-3718 for this.

On Mon, Jan 23, 2012 at 9:38 AM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
>
> 1 tests failed.
> REGRESSION:
>  org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
>
> Error Message:
> Results are not the same!
>
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not
> the same!
>at
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
>
>
>
>
> Build Log (for compile errors):
> [...truncated 6321 lines...]
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-23 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191963#comment-13191963
 ] 

Doron Cohen commented on LUCENE-3718:
-

failure consistently recreated with these parameters.
It is most likely a test bug, but still annoying.
Should also rename misspelled method - should be: testCountUsingSampling()

> SamplingWrapperTest failure with certain test seed
> --
>
> Key: LUCENE-3718
> URL: https://issues.apache.org/jira/browse/LUCENE-3718
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
>
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping
> Error Message:
> Results are not the same!
> Stack Trace:
> org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
> same!
>at 
> org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
>at 
> org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
>at 
> org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
>at 
> org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
> NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
> -Dtestmethod=testCountUsingSamping 
> -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
> -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
> NOTE: test params are: codec=Lucene40: 
> {$facets=PostingsFormat(name=MockRandom), 
> $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
> minBlockSize=65 maxBlockSize=209), 
> $payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
> sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
> Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
> timezone=Asia/Manila

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3718) SamplingWrapperTest failure with certain test seed

2012-01-23 Thread Doron Cohen (Created) (JIRA)

SamplingWrapperTest failure with certain test seed
--

 Key: LUCENE-3718
 URL: https://issues.apache.org/jira/browse/LUCENE-3718
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Reporter: Doron Cohen
Assignee: Doron Cohen
 Fix For: 3.6, 4.0


Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/

1 tests failed.
REGRESSION:  
org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping

Error Message:
Results are not the same!

Stack Trace:
org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the 
same!
   at 
org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333)
   at 
org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104)
   at 
org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82)
   at 
org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
   at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
   at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)

NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest 
-Dtestmethod=testCountUsingSamping 
-Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 
-Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
NOTE: test params are: codec=Lucene40: 
{$facets=PostingsFormat(name=MockRandom), 
$full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 
minBlockSize=65 maxBlockSize=209), 
$payloads$=PostingsFormat(name=Lucene40WithOrds)}, 
sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM 
Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, 
timezone=Asia/Manila


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191222#comment-13191222
 ] 

Doron Cohen commented on LUCENE-1812:
-

bq. I didn't test them, but I will once they have been committed.

Great, thanks! 

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191209#comment-13191209
 ] 

Doron Cohen commented on LUCENE-1812:
-

I now see that all other contrib components have svn:ignore for *.iml and 
pom.xml - I'll add that for pruning as well (though it is not in the attached 
patch).

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191206#comment-13191206
 ] 

Doron Cohen commented on LUCENE-1812:
-

Getting to this, at last. 

I did not handle the above TODO's and I rather commit so they can be handled 
later separately ("progress not perfection" as Mike says). 

Changes in this patch: 
- PruningReader overrides also getSequentialSubReaders(), otherwise no pruning 
takes place on sub-readers (and tests fail). 
- StorePruningPolicy fixed to use FieldInfos API.

I modified for Idea and maven by following templates for other contrib 
components but have no way to test this and would appreciate a review of this.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2012-01-23 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1812:


Attachment: pruning.patch

Updated patch for current 3x.

> Static index pruning by in-document term frequency (Carmel pruning)
> ---
>
> Key: LUCENE-1812
> URL: https://issues.apache.org/jira/browse/LUCENE-1812
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Andrzej Bialecki 
>    Assignee: Doron Cohen
> Fix For: 3.6, 4.0
>
> Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts

2012-01-19 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189036#comment-13189036
 ] 

Doron Cohen commented on LUCENE-3703:
-

Missed that test comment about no need for random directory.
About the decRef dup code, yeah, that's what I meant, but okay.
I think this is ready to commit.

> DirectoryTaxonomyReader.refresh misbehaves with ref counts
> --
>
> Key: LUCENE-3703
> URL: https://issues.apache.org/jira/browse/LUCENE-3703
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3703.patch, LUCENE-3703.patch
>
>
> DirectoryTaxonomyReader uses the internal IndexReader in order to track its 
> own reference counting. However, when you call refresh(), it reopens the 
> internal IndexReader, and from that point, all previous reference counting 
> gets lost (since the new IndexReader's refCount is 1).
> The solution is to track reference counting in DTR itself. I wrote a simple 
> unit test which exposes the bug (will be attached with the patch shortly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts

2012-01-18 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188975#comment-13188975
 ] 

Doron Cohen commented on LUCENE-3703:
-

Patch looks good, builds and passes for me, thanks for fixing this Shai.

Few comments:
* CHANGES: rephrase the e.g. part like this: (e.g. if application called 
incRef/decRef).
* New test:
** LTC.newDirectory() instead of new RAMDirectory().
** text messages in the asserts.
* DTR: 
** Would it be simpler to make close() synchronized (just like IR.close())
** Would it - again - be simpler to keep maintaining the ref-counts in the 
internal IR and just, in refresh, decRef as needed in the old one and incRef 
accordingly in the new one? This way we continue to delegate that logic to IR, 
and do not duplicate it.
** Current patch removes the ensureOpen() check from getRefCount(). I think 
this is correct - in fact I needed that when debugging this. Perhaps should 
document about it in CHANGES entry.

> DirectoryTaxonomyReader.refresh misbehaves with ref counts
> --
>
> Key: LUCENE-3703
> URL: https://issues.apache.org/jira/browse/LUCENE-3703
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3703.patch
>
>
> DirectoryTaxonomyReader uses the internal IndexReader in order to track its 
> own reference counting. However, when you call refresh(), it reopens the 
> internal IndexReader, and from that point, all previous reference counting 
> gets lost (since the new IndexReader's refCount is 1).
> The solution is to track reference counting in DTR itself. I wrote a simple 
> unit test which exposes the bug (will be attached with the patch shortly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3635) Allow setting arbitrary objects on PerfRunData

2011-12-19 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172215#comment-13172215
 ] 

Doron Cohen commented on LUCENE-3635:
-

Patch looks good.

bq. I do not propose to move IR/IW/TR/TW etc. into that map. If however people 
think that we should, I can do that as well.

I rather keep these ones explicit as they are now.

bq. I wonder if we should have this Map require Closeable so that we can close 
the objects on PerfRunData.close()

Closing would be convenient, but I think requiring to pass Closeable is too 
restrictive? 
Instead, you could add something like this to close():

{code}
for (Object o : perfObjects.values()) {
  if (o instanceof Closeable) {
IOUtils.close((Closeable) o);
  }
}
{code}

This is done only once at the end, so "instanceof" is not a perf issue here.
If we close like this, we also need to document it at setPerfObject().

I think, BTW, that PFD.close() is not called by the Benchmark, it has to be 
explicitly invoked by the user.

> Allow setting arbitrary objects on PerfRunData
> --
>
> Key: LUCENE-3635
> URL: https://issues.apache.org/jira/browse/LUCENE-3635
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/benchmark
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3635.patch
>
>
> PerfRunData is used as the intermediary objects between PerfRunTasks. Just 
> like we can set IndexReader/Writer on it, it will be good if it allows 
> setting other arbitrary objects that are e.g. created by one task and used by 
> another.
> A recent example is the enhancement to the benchmark package following the 
> addition of the facet module. We had to add TaxoReader/Writer.
> The proposal is to add a HashMap that custom PerfTasks can 
> set()/get(). I do not propose to move IR/IW/TR/TW etc. into that map. If 
> however people think that we should, I can do that as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-2982) Get rid of ContenSource's workaround for closing b/gzip input stream once this is fixed in CommonCompress

2011-12-08 Thread Doron Cohen (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-2982.
-

Resolution: Duplicate

Fixed in LUCENE-3457.

> Get rid of ContenSource's workaround for closing b/gzip input stream once 
> this is fixed in CommonCompress
> -
>
> Key: LUCENE-2982
> URL: https://issues.apache.org/jira/browse/LUCENE-2982
> Project: Lucene - Java
>  Issue Type: Task
>  Components: modules/benchmark
>Reporter: Doron Cohen
>Priority: Minor
>
> Once COMPRESS-127 is fixed get rid of the entire workaround method 
> ContentSource.closableCompressorInputStream(). It would simplify the code and 
> would perform better without that delegation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159086#comment-13159086
 ] 

Doron Cohen commented on LUCENE-3604:
-

bq. The new version will show up on the website once the periodic resync 
happens.

[3.5-contrib-changes|http://lucene.apache.org/java/3_5_0/changes/Contrib-Changes.html#3.5.0.api_changes]
 now shows the correct API changes. Thanks Steven!

> 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0
> -
>
> Key: LUCENE-3604
> URL: https://issues.apache.org/jira/browse/LUCENE-3604
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Doron Cohen
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.5
>
>
> There are two "API Changes" sections which is confusing when looking at the 
> txt version of the file. 
> The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: 3x/contrib/CHANGES.txt

2011-11-28 Thread Doron Cohen

This was fixed in https://issues.apache.org/jira/browse/LUCENE-3604.

On Mon, Nov 28, 2011 at 10:33 AM, Doron Cohen  wrote:

> Adding an entry to 3x/contrib/CHANGES.txt in
> https://issues.apache.org/jira/browse/LUCENE-3596 I noticed that it had
> (No changes) at the top (in the section already prepared for 3.6) although
> there was already one bug fix listed below, so removed that assuming it is
> a leftover - just wanted to verify that this was not left there with a
> purpose I am not aware of?
>
> Also, in the 3.5 section, there are two "API Changes" sub-sections.
> I checked, and it is the same also in the
> http://lucene.apache.org/java/3_5_0/changes/Contrib-Changes.html
> Mmm... in fact this is a serious problem, because by the way the
> expand/collapse work in Changes.html, the only way to actually see these
> changes is to click "Expand All".
>
> So fixing this in 3x is easy, I am not sure about the correct way to fix
> this in the Web site. I'll dig.
>

[jira] [Assigned] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-3604:
---

Assignee: (was: Doron Cohen)

I looked at "HowToUpdateTheWebSite" but still do not understand how to update 
the contrib/Changes.html that appears in the main Web site. 

Is it a 'versioned' or 'unversioned' that should be changed? 

I think it is the 'versioned' but then I don't understand the instructions in 
that wiki page: an example with 3.4.1 requires to 'svn co trunk', is this real?

So better if this is handled by someone who knows his way around this process.

> 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0
> -
>
> Key: LUCENE-3604
> URL: https://issues.apache.org/jira/browse/LUCENE-3604
>     Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Doron Cohen
>Priority: Minor
>
> There are two "API Changes" sections which is confusing when looking at the 
> txt version of the file. 
> The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158306#comment-13158306
 ] 

Doron Cohen commented on LUCENE-3604:
-

Fixed the 3x file in r1207018 - ordering the "API Changes" entries by their 
date (by svn log).
Keeping open for fixing the Changes.html that already appears in the Web site.

> 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0
> -
>
> Key: LUCENE-3604
> URL: https://issues.apache.org/jira/browse/LUCENE-3604
> Project: Lucene - Java
>  Issue Type: Bug
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
>
> There are two "API Changes" sections which is confusing when looking at the 
> txt version of the file. 
> The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0

2011-11-28 Thread Doron Cohen (Created) (JIRA)

3x/lucene/contrib/CHANGES.txt has two "API Changes" subsections for 3.5.0
-

 Key: LUCENE-3604
 URL: https://issues.apache.org/jira/browse/LUCENE-3604
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor


There are two "API Changes" sections which is confusing when looking at the txt 
version of the file. 
The HTML expands only the first of the two, unless expand-all is clicked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

3x/contrib/CHANGES.txt

2011-11-28 Thread Doron Cohen

Adding an entry to 3x/contrib/CHANGES.txt in
https://issues.apache.org/jira/browse/LUCENE-3596 I noticed that it had (No
changes) at the top (in the section already prepared for 3.6) although
there was already one bug fix listed below, so removed that assuming it is
a leftover - just wanted to verify that this was not left there with a
purpose I am not aware of?

Also, in the 3.5 section, there are two "API Changes" sub-sections.
I checked, and it is the same also in the
http://lucene.apache.org/java/3_5_0/changes/Contrib-Changes.html
Mmm... in fact this is a serious problem, because by the way the
expand/collapse work in Changes.html, the only way to actually see these
changes is to click "Expand All".

So fixing this in 3x is easy, I am not sure about the correct way to fix
this in the Web site. I'll dig.

Re: svn commit: r1206916 - in /lucene/dev/branches/branch_3x: ./ lucene/ lucene/backwards/src/ lucene/backwards/src/test-framework/ lucene/backwards/src/test/ solr/ solr/core/src/java/org/apache/solr/

2011-11-28 Thread Doron Cohen

I was trying to figure out the cause for these failures - it is somewhat
buried in the log file because of all the deprecation warnings and because
for the actual failure cause javac does not say "ERROR" (it does say
"WARNING" for the warnings...) and by the time I found it you already
committed the fix.

Doron

On Mon, Nov 28, 2011 at 9:51 AM, Uwe Schindler  wrote:

> Hi Erick,
>
> I had to fix the Java 5 errors in the 3x branch commit:
> In Java 5 interfaces do not support @Override (which is in my opinion
> correct and is horrible that it was introduced in Java 6: @Override on
> interfaces is wrong, as nothing is overridden), but for stupidity JDK6's
> compiler has a well-known bug and does not detect this syntax violation
> with -source 1.5).
>
> I recommend having a JDK5 installed to do the final test run before
> committing.
>
> Uwe
>

[jira] [Resolved] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-28 Thread Doron Cohen (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-3596.
-

Resolution: Fixed

Thanks for reviewing Shai.

Committed:
- r1206996 - trunk
- r1207008 - 3x

CHANGES.txt entry is only in 3x because in trunk facet is under modules. I 
don't like this difference, but there it is.

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-3596.patch, LUCENE-3596.patch
>
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-27 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3596:


Attachment: LUCENE-3596.patch

Patch taking approach (1) above, and moving createIWC() to constructor. 

In addition fixed some javadoc comments, and added an assert to the 
constructor, which, only when assertions are enabled, will verify that the IWC 
in effect is not an instance of TieredMergePolicy. Imperfect as this is, it at 
least exposed the problem in current test (fixed to use newLogMP()).

I think this is ready to commit.

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/facet
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-3596.patch, LUCENE-3596.patch
>
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-27 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157717#comment-13157717
 ] 

Doron Cohen commented on LUCENE-3596:
-

Also, there seems to be a bug in current taxonomy writer test - TestIndexClose 
- where the IndexWriterConfig's merge policy might allow to merge segments 
out-of-order. That test calls LTC.newIndexWriterConfig() and it is just by luck 
that this test have not failed so far.

This is a bad type of failure for an application (is there ever a good 
type?;)), because by the time the bug is exposed it would show as a wrong facet 
returned in faceted search, and go figure that late that this is because at an 
earlier time an index writer was created which allowed out-of-order merging...

Therefore, it would have been useful if, in addition to the javadocs about 
requiring type of merge policy, we would also throw an exception 
(IllegalArgument or IO) if the IWC's merge policy allows merging out-of-order. 
This should be checked in two locations: 
- createIWC() returns
- openIndex() returns, by examining the IWC of the index

The second check is more involved as it is done after the index was already 
opened, so it must be closed prior to throwing that exception.

However, merge-policy does not have in its "contract" anything like 
Collector.acceptsDocsOutOfOrder(), so it is not possible to verify this at all.

Adding such a method to MergePolicy seems to me an over-kill, for this 
particular case, unless there is additional interest in such a declaration?

Otherwise, it is possible to require that the merge policy must be a descendant 
of LogMergePolicy. This on the other hand would not allow to test this class 
with other order-preserving policies, such as NoMerge.

So I am not sure what is the best  way to proceed in this regard.

I think there are two options actually:
# just javadoc that fact, and fix the test to always create an order preserving 
MP.
# add that declaration to MP.

Unless there are opinions favoring the second option I'll go with the first one.

In addition, (this is true for both options) I will move the call to createIWC 
into the constructor and modify openIndex signature to accept an IWC instead of 
the open mode, as it seems wrong - API wise - that one extension point 
(createIWC) is invoked by another extension point (openIndex) - better have 
them both be invoked from the constructor, making it harder for someone to, by 
mistake, totally ignore in createIndex() the value returned by createIWC().

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>      Components: modules/facet
>        Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-3596.patch
>
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-26 Thread Doron Cohen (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-3596:
---

Assignee: Doron Cohen

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/facet
>    Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-3596.patch
>
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-26 Thread Doron Cohen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-3596:


Attachment: LUCENE-3596.patch

patch adds the method createIndexWriterConfig(OpenMode openMode) and javadocs 
for in-order segments merging.

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/facet
>    Reporter: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-3596.patch
>
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream

2011-11-26 Thread Doron Cohen (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157605#comment-13157605
 ] 

Doron Cohen commented on LUCENE-3596:
-

bq. and getIWC (if you intend to add it).
Yes that's what I would like to add.
These docs are missing then anyhow, with or without getIWC(). 
This added extendability is useful although behavior regarding info-stream 
differs between trunk and 3x - i.e. that in 3x one can set that stream also 
with current extension point.

> DirectoryTaxonomyWriter extensions should be able to set internal index 
> writer config attributes such as info stream
> 
>
> Key: LUCENE-3596
> URL: https://issues.apache.org/jira/browse/LUCENE-3596
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Doron Cohen
>Priority: Minor
>
> Current protected openIndexWriter(Directory directory, OpenMode openMode) 
> does not provide access to the IWC it creates.
> So extensions must reimplement this method completely in order to set e.f. 
> info stream for the internal index writer.
> This came up in [user question: Taxonomy indexer debug 
> |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: IndexWriter.infoStream is final?

2011-11-26 Thread Doron Cohen

>
> I changed this because its totally insane in 3.x.
>
> Because its 'live settable' there is crazy half-working (i don't think
> it really works) synchronization in various IndexWriter classes for
> this. Each one had a setter and each one had its own message() and
> other helper routines, and when you did set() on indexwriter it called
> set() on all the other classes in a broken way.
>

I like the new behavior better, but missed the fact that setting the
infoStream is still available in 3x.

1 2 3 4 >

1 - 100 of 366 matches

Mail list logo