date:20080903

[jira] Resolved: (LUCENE-1371) Add Searcher.search(Query, int)

2008-09-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1371.


Resolution: Fixed

> Add Searcher.search(Query, int)
> ---
>
> Key: LUCENE-1371
> URL: https://issues.apache.org/jira/browse/LUCENE-1371
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
>
> Now that we've deprecated Hits (LUCENE-1290), I think we should add this 
> trivial convenience method to Searcher, which is just sugar for 
> Searcher.search(Query, null, int) ie null filter, returning a TopDocs.
> This way there is a simple API for users to retrieve the top N results for a 
> Query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

2008-09-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627943#action_12627943
 ] 

Michael McCandless commented on LUCENE-1126:


Hmm -- I'm now seeing an failure with this patch, in TestThaiAnalyzer (in 
contrib/analyzers):

{code}
[junit] Testcase: 
testAnalyzer(org.apache.lucene.analysis.th.TestThaiAnalyzer): FAILED
[junit] expected: but was:
[junit] junit.framework.ComparisonFailure: expected: but was:
[junit] at 
org.apache.lucene.analysis.th.TestThaiAnalyzer.assertAnalyzesTo(TestThaiAnalyzer.java:43)
[junit] at 
org.apache.lucene.analysis.th.TestThaiAnalyzer.testAnalyzer(TestThaiAnalyzer.java:54)
[junit] 
{code}

Does anyone else see this?

> Simplify StandardTokenizer JFlex grammar
> 
>
> Key: LUCENE-1126
> URL: https://issues.apache.org/jira/browse/LUCENE-1126
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.2
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a 
> question on Korean ranges" begun by Daniel Noll on java-user, and carried 
> over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class 
> [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as 
> java.lang.Character.isLetter()) includes CJK characters, there is a way to 
> handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex 
> documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is 
> !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
> LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason 
> (AFAICT) to treat them separately; unlike Chinese and Japanese characters, 
> which are individually tokenized, the Korean characters should participate in 
> the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 
> supports, and Unicode 5.0, the latest version, and there are lots of new and 
> modified letter and digit ranges.  This stuff gets tweaked all the time, and 
> I don't think Lucene should be in the business of trying to track it, or take 
> a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character 
> classes ties (most of) these decisions to the user's choice of JVM version, 
> and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Michael McCandless (JIRA)

Merging of compressed string Fields may hit NPE
---

 Key: LUCENE-1374
 URL: https://issues.apache.org/jira/browse/LUCENE-1374
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.4


This bug was introduced with LUCENE-1219 (only present on 2.4).

The bug happens when merging compressed string fields, but only if bulk-merging 
code does not apply because the FieldInfos for the segment being merged are not 
congruent.  This test shows the bug:

{code}
  public void testMergeCompressedFields() throws IOException {
File indexDir = new File(System.getProperty("tempDir"), 
"mergecompressedfields");
Directory dir = FSDirectory.getDirectory(indexDir);
try {
  for(int i=0;i<5;i++) {
// Must make a new writer & doc each time, w/
// different fields, so bulk merge of stored fields
// cannot run:
IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
IndexWriter.MaxFieldLength.UNLIMITED);
w.setMergeFactor(5);
w.setMergeScheduler(new SerialMergeScheduler());
Document doc = new Document();
doc.add(new Field("test1", "this is some data that will be compressed 
this this this", Field.Store.COMPRESS, Field.Index.NO));
doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
doc.add(new Field("field" + i, "random field", Field.Store.NO, 
Field.Index.TOKENIZED));
w.addDocument(doc);
w.close();
  }

  byte[] cmp = new byte[20];

  IndexReader r = IndexReader.open(dir);
  for(int i=0;i<5;i++) {
Document doc = r.document(i);
assertEquals("this is some data that will be compressed this this 
this", doc.getField("test1").stringValue());
byte[] b = doc.getField("test2").binaryValue();
assertTrue(Arrays.equals(b, cmp));
  }
} finally {
  dir.close();
  _TestUtil.rmDir(indexDir);
}
  }
{code}

It's because in FieldsReader, when we load a field "for merge" we create a 
FieldForMerge instance which subsequently does not return the right values for 
getBinary{Value,Length,Offset}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1374:
---

Attachment: LUCENE-1374.patch

Attached patch that fixes AbstractField's getBinaryValue() and 
getBinaryLength() methods to fallback to "fieldsData instanceof byte[]" when 
appropriate.  I plan to commit shortly.

> Merging of compressed string Fields may hit NPE
> ---
>
> Key: LUCENE-1374
> URL: https://issues.apache.org/jira/browse/LUCENE-1374
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1374.patch
>
>
> This bug was introduced with LUCENE-1219 (only present on 2.4).
> The bug happens when merging compressed string fields, but only if 
> bulk-merging code does not apply because the FieldInfos for the segment being 
> merged are not congruent.  This test shows the bug:
> {code}
>   public void testMergeCompressedFields() throws IOException {
> File indexDir = new File(System.getProperty("tempDir"), 
> "mergecompressedfields");
> Directory dir = FSDirectory.getDirectory(indexDir);
> try {
>   for(int i=0;i<5;i++) {
> // Must make a new writer & doc each time, w/
> // different fields, so bulk merge of stored fields
> // cannot run:
> IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
> IndexWriter.MaxFieldLength.UNLIMITED);
> w.setMergeFactor(5);
> w.setMergeScheduler(new SerialMergeScheduler());
> Document doc = new Document();
> doc.add(new Field("test1", "this is some data that will be compressed 
> this this this", Field.Store.COMPRESS, Field.Index.NO));
> doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
> doc.add(new Field("field" + i, "random field", Field.Store.NO, 
> Field.Index.TOKENIZED));
> w.addDocument(doc);
> w.close();
>   }
>   byte[] cmp = new byte[20];
>   IndexReader r = IndexReader.open(dir);
>   for(int i=0;i<5;i++) {
> Document doc = r.document(i);
> assertEquals("this is some data that will be compressed this this 
> this", doc.getField("test1").stringValue());
> byte[] b = doc.getField("test2").binaryValue();
> assertTrue(Arrays.equals(b, cmp));
>   }
> } finally {
>   dir.close();
>   _TestUtil.rmDir(indexDir);
> }
>   }
> {code}
> It's because in FieldsReader, when we load a field "for merge" we create a 
> FieldForMerge instance which subsequently does not return the right values 
> for getBinary{Value,Length,Offset}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Nadav Har'El

On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving 
SweetSpotSimilarity out of contrib":
> 
> : >From a legal standpoint, whenever we need to use open-source code, somebody
> : has to inspect the code and 'approve' it. This inspection makes sure there's
> : no use of 3rd party libraries, to which we'd need to get open-source
> : clearance as well.
> 
> You should talk to whomever you need to talk to at your company about 
> revising the appraoch you are taking -- the core vs contrib distinction in 
> Lucene-Java is one of our own making that is completly artificial.  With 
> Lucene 2.4 we could decide to split what is currently known as the "core" 
> into 27 different directories, none of which are called core, and all of 
> which have an interdependency on eachother.  We're not likely to, but we 
> could -- and then where woud your company be?

I can't really defend the lawyers (sometimes you get the feeling that they
are out to slow you down, rather than help you :( ), but let me try to explain
where this sort of thinking comes from, because I think it is actually quite
common.

Lucene makes the claim that it has the "apache license", so that any company
can (to make a long story short) use this code. But when a company sets out
to use Lucene, can it take this claim at face value? After all, what happens
if somebody steals some proprietary code and puts it up on the web claiming it
has the apache license - does it give the users of that stolen code any
rights? Of course not, because the rights weren't the distributor's to give
out in the first place.

So it is quite natural that when a company wants to use use some open-source
code it doesn't take the license at face value, and rather does some "due
diligance" to verify that the people who published this code really owned
the rights to it. E.g., the company lawyers might want to do some background
checks on the committers, look at the project's history (e.g., that it doesn't
have some "out of the blue" donations from vague sources), check the code and
comments for suspicious strings, patterns, and so on.

When you need to inspect the code, naturally you need to decide what you
inspect. This particular company chose to inspect only the Lucene core,
perhaps because it is smaller, has fewer contributors, and has the vast
majority of what most Lucene users need. Inspecting all the contrib - with
all its foreign language analyzers, stuff like gdata and other rarely used
stuff - may be too hard for them. But then, the question I would ask is -
why not inspect the core *and* the few contribs that interest you? For
example, SweetSpotSimilarity (which you need) and other generally useful
stuff like Highlighter and SnowballAnalyzer.

> Doing this would actually be a complete reversal of the goals discussed in 
> the near past:  increasing our use of the contrib structure for new 
> features that aren't inherently tied to the "guts" of Lucene.  The goal 
> being to keep the "core" jar as small as possible for people who want to 
> develop apps with a small foot print.

I agree that this is an important goal.

> At one point there was even talk of refactoring additional code out of the 
> core and into a contrib (this was already done with some analyzers when 
> Lucene became a TLP)

-- 
Nadav Har'El|  Wednesday, Sep  3 2008, 3 Elul 5768
IBM Haifa Research Lab  |-
|Promises are like babies: fun to make,
http://nadav.harel.org.il   |but hell to deliver.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Shai Erera

Thanks all for the "legal" comments.

Can we consider moving the SweetSpotSimilarity to "core" because of the
quality improvements it introduces to search? I tried to emphasize that
that's the main reason, but perhaps I didn't do a good job at that, since
the discussion has turned into a legal issue :-).

On Wed, Sep 3, 2008 at 3:21 PM, Nadav Har'El <[EMAIL PROTECTED]>wrote:

> On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving
> SweetSpotSimilarity out of contrib":
> >
> > : >From a legal standpoint, whenever we need to use open-source code,
> somebody
> > : has to inspect the code and 'approve' it. This inspection makes sure
> there's
> > : no use of 3rd party libraries, to which we'd need to get open-source
> > : clearance as well.
> >
> > You should talk to whomever you need to talk to at your company about
> > revising the appraoch you are taking -- the core vs contrib distinction
> in
> > Lucene-Java is one of our own making that is completly artificial.  With
> > Lucene 2.4 we could decide to split what is currently known as the "core"
> > into 27 different directories, none of which are called core, and all of
> > which have an interdependency on eachother.  We're not likely to, but we
> > could -- and then where woud your company be?
>
> I can't really defend the lawyers (sometimes you get the feeling that they
> are out to slow you down, rather than help you :( ), but let me try to
> explain
> where this sort of thinking comes from, because I think it is actually
> quite
> common.
>
> Lucene makes the claim that it has the "apache license", so that any
> company
> can (to make a long story short) use this code. But when a company sets out
> to use Lucene, can it take this claim at face value? After all, what
> happens
> if somebody steals some proprietary code and puts it up on the web claiming
> it
> has the apache license - does it give the users of that stolen code any
> rights? Of course not, because the rights weren't the distributor's to give
> out in the first place.
>
> So it is quite natural that when a company wants to use use some
> open-source
> code it doesn't take the license at face value, and rather does some "due
> diligance" to verify that the people who published this code really owned
> the rights to it. E.g., the company lawyers might want to do some
> background
> checks on the committers, look at the project's history (e.g., that it
> doesn't
> have some "out of the blue" donations from vague sources), check the code
> and
> comments for suspicious strings, patterns, and so on.
>
> When you need to inspect the code, naturally you need to decide what you
> inspect. This particular company chose to inspect only the Lucene core,
> perhaps because it is smaller, has fewer contributors, and has the vast
> majority of what most Lucene users need. Inspecting all the contrib - with
> all its foreign language analyzers, stuff like gdata and other rarely used
> stuff - may be too hard for them. But then, the question I would ask is -
> why not inspect the core *and* the few contribs that interest you? For
> example, SweetSpotSimilarity (which you need) and other generally useful
> stuff like Highlighter and SnowballAnalyzer.
>
> > Doing this would actually be a complete reversal of the goals discussed
> in
> > the near past:  increasing our use of the contrib structure for new
> > features that aren't inherently tied to the "guts" of Lucene.  The goal
> > being to keep the "core" jar as small as possible for people who want to
> > develop apps with a small foot print.
>
> I agree that this is an important goal.
>
> > At one point there was even talk of refactoring additional code out of
> the
> > core and into a contrib (this was already done with some analyzers when
> > Lucene became a TLP)
>
> --
> Nadav Har'El|  Wednesday, Sep  3 2008, 3 Elul
> 5768
> IBM Haifa Research Lab
>  |-
>|Promises are like babies: fun to make,
> http://nadav.harel.org.il   |but hell to deliver.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Mark Miller

I think its a fair question that, regardless of the legal mumbo jumbo 
provoking it, can be considered on the merits that it should be - is it 
something important enough to bulk up the core with the trade off being 
more people will find it helpful and can use it with slightly less hassle?


I have seen discussion about about core vs contrib before, and from what 
I saw, the distinction and rules are not quite clear. I would think 
though, if the new Similarity is really that much better than the old, 
it might actually benefit in core. There is no doubt core gets more 
attention on both the user and developer side, and important pieces with 
general usages should probably be there.


I havn't used it myself, so I won't guess (too much ), but the 
question to me seems to be, is SweetSpot important enough to move to 
core? Are there enough good reasons? And even if so, is it ready to move 
to core? Contrib also seems to be somewhat of a possible incubation area...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

2008-09-03 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627990#action_12627990
 ] 

Grant Ingersoll commented on LUCENE-1373:
-

I think you should mirror what is done in StandardAnalyzer.  You probably could 
create an abstract class that all of them inherit to share the common code.

Of course, it's still a bit weird, b/c in your case the type value is going to 
be set to ACRONYM, when your example is clearly not one.  This suggests to me 
that the grammar needs to be revisited, but that can wait until 3.0 I believe.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> --
>
> Key: LUCENE-1373
> URL: https://issues.apache.org/jira/browse/LUCENE-1373
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Mark Lassau
>Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like 
> "www.apache.org." would be incorrectly tokenized as an acronym (note the dot 
> at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to 
> harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but 
> unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the 
> StandardTokenizer, and in v2.3.2 not one of these provides a way to get the 
> non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread mark harwood

Not tried SweetSpot so can't comment on worthiness of moving to core but agree 
with the principle that we can't let the hassles of a company's "due diligence" 
testing dictate the shape of core vs contrib.

For anyone concerned with the overhead of doing these checks a company/product 
of potential interest is "Black Duck".
I don't work for them and don't offer any endorsement but simply point them out 
as something you might want to take a look at.

Cheers
Mark



- Original Message 
From: Nadav Har'El <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Wednesday, 3 September, 2008 13:21:34
Subject: Re: Moving SweetSpotSimilarity out of contrib

On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving 
SweetSpotSimilarity out of contrib":
> 
> : >From a legal standpoint, whenever we need to use open-source code, somebody
> : has to inspect the code and 'approve' it. This inspection makes sure there's
> : no use of 3rd party libraries, to which we'd need to get open-source
> : clearance as well.
> 
> You should talk to whomever you need to talk to at your company about 
> revising the appraoch you are taking -- the core vs contrib distinction in 
> Lucene-Java is one of our own making that is completly artificial.  With 
> Lucene 2.4 we could decide to split what is currently known as the "core" 
> into 27 different directories, none of which are called core, and all of 
> which have an interdependency on eachother.  We're not likely to, but we 
> could -- and then where woud your company be?

I can't really defend the lawyers (sometimes you get the feeling that they
are out to slow you down, rather than help you :( ), but let me try to explain
where this sort of thinking comes from, because I think it is actually quite
common.

Lucene makes the claim that it has the "apache license", so that any company
can (to make a long story short) use this code. But when a company sets out
to use Lucene, can it take this claim at face value? After all, what happens
if somebody steals some proprietary code and puts it up on the web claiming it
has the apache license - does it give the users of that stolen code any
rights? Of course not, because the rights weren't the distributor's to give
out in the first place.

So it is quite natural that when a company wants to use use some open-source
code it doesn't take the license at face value, and rather does some "due
diligance" to verify that the people who published this code really owned
the rights to it. E.g., the company lawyers might want to do some background
checks on the committers, look at the project's history (e.g., that it doesn't
have some "out of the blue" donations from vague sources), check the code and
comments for suspicious strings, patterns, and so on.

When you need to inspect the code, naturally you need to decide what you
inspect. This particular company chose to inspect only the Lucene core,
perhaps because it is smaller, has fewer contributors, and has the vast
majority of what most Lucene users need. Inspecting all the contrib - with
all its foreign language analyzers, stuff like gdata and other rarely used
stuff - may be too hard for them. But then, the question I would ask is -
why not inspect the core *and* the few contribs that interest you? For
example, SweetSpotSimilarity (which you need) and other generally useful
stuff like Highlighter and SnowballAnalyzer.

> Doing this would actually be a complete reversal of the goals discussed in 
> the near past:  increasing our use of the contrib structure for new 
> features that aren't inherently tied to the "guts" of Lucene.  The goal 
> being to keep the "core" jar as small as possible for people who want to 
> develop apps with a small foot print.

I agree that this is an important goal.

> At one point there was even talk of refactoring additional code out of the 
> core and into a contrib (this was already done with some analyzers when 
> Lucene became a TLP)

-- 
Nadav Har'El|  Wednesday, Sep  3 2008, 3 Elul 5768
IBM Haifa Research Lab  |-
|Promises are like babies: fun to make,
http://nadav.harel.org.il   |but hell to deliver.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1374.


Resolution: Fixed

Committed revision 691617.

> Merging of compressed string Fields may hit NPE
> ---
>
> Key: LUCENE-1374
> URL: https://issues.apache.org/jira/browse/LUCENE-1374
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1374.patch
>
>
> This bug was introduced with LUCENE-1219 (only present on 2.4).
> The bug happens when merging compressed string fields, but only if 
> bulk-merging code does not apply because the FieldInfos for the segment being 
> merged are not congruent.  This test shows the bug:
> {code}
>   public void testMergeCompressedFields() throws IOException {
> File indexDir = new File(System.getProperty("tempDir"), 
> "mergecompressedfields");
> Directory dir = FSDirectory.getDirectory(indexDir);
> try {
>   for(int i=0;i<5;i++) {
> // Must make a new writer & doc each time, w/
> // different fields, so bulk merge of stored fields
> // cannot run:
> IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
> IndexWriter.MaxFieldLength.UNLIMITED);
> w.setMergeFactor(5);
> w.setMergeScheduler(new SerialMergeScheduler());
> Document doc = new Document();
> doc.add(new Field("test1", "this is some data that will be compressed 
> this this this", Field.Store.COMPRESS, Field.Index.NO));
> doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
> doc.add(new Field("field" + i, "random field", Field.Store.NO, 
> Field.Index.TOKENIZED));
> w.addDocument(doc);
> w.close();
>   }
>   byte[] cmp = new byte[20];
>   IndexReader r = IndexReader.open(dir);
>   for(int i=0;i<5;i++) {
> Document doc = r.document(i);
> assertEquals("this is some data that will be compressed this this 
> this", doc.getField("test1").stringValue());
> byte[] b = doc.getField("test2").binaryValue();
> assertTrue(Arrays.equals(b, cmp));
>   }
> } finally {
>   dir.close();
>   _TestUtil.rmDir(indexDir);
> }
>   }
> {code}
> It's because in FieldsReader, when we load a field "for merge" we create a 
> FieldForMerge instance which subsequently does not return the right values 
> for getBinary{Value,Length,Offset}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multi Phrase Search at the Beginning of a field

2008-09-03 Thread ext-vinay.thota

 Excellent, it worked :)

Thank you Tori!!

Regards,
Vinay

>-Original Message-
>From: ext Andraz Tori [mailto:[EMAIL PROTECTED] 
>Sent: 01 September, 2008 16:39
>To: java-dev@lucene.apache.org
>Subject: Re: Multi Phrase Search at the Beginning of a field
>
>You can use standard trick.
>
>Insert a special token at the beginning of every field you are 
>indexing, and add that special token to beginning of every query.
>
>Since this token will not occur anywhere else in the field, 
>you will know that your queries match only beginnings of fields
>
>bye
>andraz
>
>On Mon, 2008-09-01 at 15:50 +0300, [EMAIL PROTECTED] wrote:
>> Hi,
>> 
>> Can some one please help me in providing a solution for my problem:
>> 
>> I have a single field defined in my document. Now I want to do a 
>> MultiPhraseQuery - but at the beginning of the field.
>> 
>> For e.g: If there are 3 documents with single field ( say 'title' ) 
>> has the values -> "Hello Love you", "Love You Sister", "Love Yoxyz"
>> 
>> Then my search for "Love yo*" ->  MultiPhraseQuery with first term 
>> "Love" ( using addTerm("Love") and the next terms ( using 
>> addTerms("Yo*"  - after getting all terms 'You' and 'Yoxyz' using
>> IndexReader.terms(Yo) ) should return only the documents "Love You 
>> Sister", "Love Yoxyz" - but not "Hello Love you".
>> 
>> Can some one please help me on how to get it done.
>> 
>> 
>> Regards,
>> Vinay
>> 
>--
>Andraz Tori, CTO
>Zemanta Ltd, London, Ljubljana
>www.zemanta.com
>mail: [EMAIL PROTECTED]
>tel: +386 41 515 767
>twitter: andraz, skype: minmax_test
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

2008-09-03 Thread Ning Li (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628025#action_12628025
 ] 

Ning Li commented on LUCENE-532:


Is the use of seek and write in ChecksumIndexOutput making Lucene less likely 
to support all sequential write (i.e. no seek write)? ChecksumIndexOutput is 
currently used by SegmentInfos.

> [PATCH] Indexing on Hadoop distributed file system
> --
>
> Key: LUCENE-532
> URL: https://issues.apache.org/jira/browse/LUCENE-532
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9
>Reporter: Igor Bolotin
>Priority: Minor
> Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, 
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on 
> Hadoop distributed file system. When we tried to do it directly on DFS using 
> Nutch FsDirectory class - we immediately found that indexing fails because 
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
> for this behavior is clear - DFS does not support random updates and so 
> seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need 
> them? Search in the Lucene code revealed 2 places which call 
> IndexOutput.seek() method: one is in TermInfosWriter and another one in 
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the 
> only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total 
> number of terms in the file back into the beginning of the file. It was very 
> simple to change file format a little bit and write number of terms into last 
> 8 bytes of the file instead of writing them into beginning of file. The only 
> other place that should be fixed in order for this to work is in 
> SegmentTermEnum constructor - to read this piece of information at position = 
> file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index 
> directly to DFS without any problems. Well - we still don't index directly to 
> DFS for performance reasons, but at least we can build small local indexes 
> and merge them into the main index on DFS without copying big main index back 
> and forth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Can I filter the results returned by IndexReader.terms(term)?

2008-09-03 Thread AdrianPillinger


I am using IndexReader.terms(term) to produce term suggestions to my users as
they type. In many cases the user is searching lucene with a filter applied,
for example a date range. 

Is there any way I can get a list of terms in the index that are contained
within a subset of the documents by a given filter.

i.e. I'd like to do something like

...
IndexReader reader = readerProvider.openReader(directoryProvider);
reader.filterDocument(filter);
TermEnum termEnum = reader.terms(new Term("name", ""));
...Iterate on terms



I've scouted all over the API and I cannot find how to do this or if it is
possible.

Please let me know if it can be done and if so how.

Thanks!


-- 
View this message in context: 
http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28term%29--tp19292207p19292207.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I filter the results returned by IndexReader.terms(term)?

2008-09-03 Thread mark harwood

One way is to read TermDocs for each candidate term and see if they are in your 
filter - but that sounds like a lot of disk IO to me when responding to 
individual user keystrokes.
You can use "skip" to avoid reading all term docs when you know what is in the 
filter but it all seems a bit costly.

It's hard to optimise in advance for this, especially if the filter is an 
arbitrary choice of documents for each user.



- Original Message 
From: AdrianPillinger <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Wednesday, 3 September, 2008 16:54:11
Subject: Can I filter the results returned by IndexReader.terms(term)?


I am using IndexReader.terms(term) to produce term suggestions to my users as
they type. In many cases the user is searching lucene with a filter applied,
for example a date range. 

Is there any way I can get a list of terms in the index that are contained
within a subset of the documents by a given filter.

i.e. I'd like to do something like

...
IndexReader reader = readerProvider.openReader(directoryProvider);
reader.filterDocument(filter);
TermEnum termEnum = reader.terms(new Term("name", ""));
...Iterate on terms



I've scouted all over the API and I cannot find how to do this or if it is
possible.

Please let me know if it can be done and if so how.

Thanks!


-- 
View this message in context: 
http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28term%29--tp19292207p19292207.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Chris Harris (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628055#action_12628055
 ] 

Chris Harris commented on LUCENE-1374:
--

"ant test" on 691617 for me fails on the following test:

  
java.io.IOException: could not delete 
C:\lucene\691647\build\test\mergecompressedfields\_5.cfs
at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37)
at 
org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111)

  

It might be one of those things that shows up only on Windows. In any case, 
adding a call to IndexReader.close() in testMergeCompressedFields() seems to 
fix things up:

  IndexReader r = IndexReader.open(dir);
  for(int i=0;i<5;i++) {
Document doc = r.document(i);
assertEquals("this is some data that will be compressed this this 
this", doc.getField("test1").stringValue());
byte[] b = doc.getField("test2").binaryValue();
assertTrue(Arrays.equals(b, cmp));
  }
  r.close();  // <--- New line
} finally {
  dir.close();
  _TestUtil.rmDir(indexDir);
}

I guess technically the r.close() probably belongs in a finally block as well.

> Merging of compressed string Fields may hit NPE
> ---
>
> Key: LUCENE-1374
> URL: https://issues.apache.org/jira/browse/LUCENE-1374
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1374.patch
>
>
> This bug was introduced with LUCENE-1219 (only present on 2.4).
> The bug happens when merging compressed string fields, but only if 
> bulk-merging code does not apply because the FieldInfos for the segment being 
> merged are not congruent.  This test shows the bug:
> {code}
>   public void testMergeCompressedFields() throws IOException {
> File indexDir = new File(System.getProperty("tempDir"), 
> "mergecompressedfields");
> Directory dir = FSDirectory.getDirectory(indexDir);
> try {
>   for(int i=0;i<5;i++) {
> // Must make a new writer & doc each time, w/
> // different fields, so bulk merge of stored fields
> // cannot run:
> IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
> IndexWriter.MaxFieldLength.UNLIMITED);
> w.setMergeFactor(5);
> w.setMergeScheduler(new SerialMergeScheduler());
> Document doc = new Document();
> doc.add(new Field("test1", "this is some data that will be compressed 
> this this this", Field.Store.COMPRESS, Field.Index.NO));
> doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
> doc.add(new Field("field" + i, "random field", Field.Store.NO, 
> Field.Index.TOKENIZED));
> w.addDocument(doc);
> w.close();
>   }
>   byte[] cmp = new byte[20];
>   IndexReader r = IndexReader.open(dir);
>   for(int i=0;i<5;i++) {
> Document doc = r.document(i);
> assertEquals("this is some data that will be compressed this this 
> this", doc.getField("test1").stringValue());
> byte[] b = doc.getField("test2").binaryValue();
> assertTrue(Arrays.equals(b, cmp));
>   }
> } finally {
>   dir.close();
>   _TestUtil.rmDir(indexDir);
> }
>   }
> {code}
> It's because in FieldsReader, when we load a field "for merge" we create a 
> FieldForMerge instance which subsequently does not return the right values 
> for getBinary{Value,Length,Offset}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Chris Harris (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628055#action_12628055
 ] 

ryguasu edited comment on LUCENE-1374 at 9/3/08 10:07 AM:
---

"ant test" on 691617 for me fails on the following test:

  
java.io.IOException: could not delete 
C:\lucene\691647\build\test\mergecompressedfields\_5.cfs
at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37)
at 
org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111)

  

It might be one of those things that shows up only on Windows. In any case, 
adding a call to IndexReader.close() in testMergeCompressedFields() seems to 
fix things up:

{code}
  IndexReader r = IndexReader.open(dir);
  for(int i=0;i<5;i++) {
Document doc = r.document(i);
assertEquals("this is some data that will be compressed this this 
this", doc.getField("test1").stringValue());
byte[] b = doc.getField("test2").binaryValue();
assertTrue(Arrays.equals(b, cmp));
  }
  r.close();  // <--- New line
} finally {
  dir.close();
  _TestUtil.rmDir(indexDir);
}
{code}

I guess technically the r.close() probably belongs in a finally block as well.

  was (Author: ryguasu):
"ant test" on 691617 for me fails on the following test:

  
java.io.IOException: could not delete 
C:\lucene\691647\build\test\mergecompressedfields\_5.cfs
at org.apache.lucene.util._TestUtil.rmDir(_TestUtil.java:37)
at 
org.apache.lucene.index.TestIndexWriter.testMergeCompressedFields(TestIndexWriter.java:4111)

  

It might be one of those things that shows up only on Windows. In any case, 
adding a call to IndexReader.close() in testMergeCompressedFields() seems to 
fix things up:

  IndexReader r = IndexReader.open(dir);
  for(int i=0;i<5;i++) {
Document doc = r.document(i);
assertEquals("this is some data that will be compressed this this 
this", doc.getField("test1").stringValue());
byte[] b = doc.getField("test2").binaryValue();
assertTrue(Arrays.equals(b, cmp));
  }
  r.close();  // <--- New line
} finally {
  dir.close();
  _TestUtil.rmDir(indexDir);
}

I guess technically the r.close() probably belongs in a finally block as well.
  
> Merging of compressed string Fields may hit NPE
> ---
>
> Key: LUCENE-1374
> URL: https://issues.apache.org/jira/browse/LUCENE-1374
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1374.patch
>
>
> This bug was introduced with LUCENE-1219 (only present on 2.4).
> The bug happens when merging compressed string fields, but only if 
> bulk-merging code does not apply because the FieldInfos for the segment being 
> merged are not congruent.  This test shows the bug:
> {code}
>   public void testMergeCompressedFields() throws IOException {
> File indexDir = new File(System.getProperty("tempDir"), 
> "mergecompressedfields");
> Directory dir = FSDirectory.getDirectory(indexDir);
> try {
>   for(int i=0;i<5;i++) {
> // Must make a new writer & doc each time, w/
> // different fields, so bulk merge of stored fields
> // cannot run:
> IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
> IndexWriter.MaxFieldLength.UNLIMITED);
> w.setMergeFactor(5);
> w.setMergeScheduler(new SerialMergeScheduler());
> Document doc = new Document();
> doc.add(new Field("test1", "this is some data that will be compressed 
> this this this", Field.Store.COMPRESS, Field.Index.NO));
> doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
> doc.add(new Field("field" + i, "random field", Field.Store.NO, 
> Field.Index.TOKENIZED));
> w.addDocument(doc);
> w.close();
>   }
>   byte[] cmp = new byte[20];
>   IndexReader r = IndexReader.open(dir);
>   for(int i=0;i<5;i++) {
> Document doc = r.document(i);
> assertEquals("this is some data that will be compressed this this 
> this", doc.getField("test1").stringValue());
> byte[] b = doc.getField("test2").binaryValue();
> assertTrue(Arrays.equals(b, cmp));
>   }
> } finally {
>   dir.close();
>   _TestUtil.rmDir(indexDir);
> }
>   }
> {code}
> It's because in FieldsReader, when we load a field "for merge" we create a 
> FieldForMerge instance which subsequently does not return the right values 
> for getBinary{Value,Length,Offset}.

-- 
This message is automatically gen

Re: Can I filter the results returned by IndexReader.terms(term)?

2008-09-03 Thread Paul Elschot

Another way is to use the trunk, where Scorer is a subclass of
DocIdSetIterator, which is returned by a Filter.
This allows to create a TermFilter that returns a TermScorer
(which is based on TermEnum internally.)

Try wrapping it in a CachingWrapperFilter when it needs to be reused.
Finally,  have a look here to see whether it could help in your case:
https://issues.apache.org/jira/browse/LUCENE-1296

Regards,
Paul Elschot



Op Wednesday 03 September 2008 18:00:27 schreef mark harwood:
> One way is to read TermDocs for each candidate term and see if they
> are in your filter - but that sounds like a lot of disk IO to me when
> responding to individual user keystrokes. You can use "skip" to avoid
> reading all term docs when you know what is in the filter but it all
> seems a bit costly.
>
> It's hard to optimise in advance for this, especially if the filter
> is an arbitrary choice of documents for each user.
>
>
>
> - Original Message 
> From: AdrianPillinger <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 3 September, 2008 16:54:11
> Subject: Can I filter the results returned by
> IndexReader.terms(term)?
>
>
> I am using IndexReader.terms(term) to produce term suggestions to my
> users as they type. In many cases the user is searching lucene with a
> filter applied, for example a date range.
>
> Is there any way I can get a list of terms in the index that are
> contained within a subset of the documents by a given filter.
>
> i.e. I'd like to do something like
>
> ...
> IndexReader reader = readerProvider.openReader(directoryProvider);
> reader.filterDocument(filter);
> TermEnum termEnum = reader.terms(new Term("name", " term>")); ...Iterate on terms
> 
>
>
> I've scouted all over the API and I cannot find how to do this or if
> it is possible.
>
> Please let me know if it can be done and if so how.
>
> Thanks!



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1374) Merging of compressed string Fields may hit NPE

2008-09-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628067#action_12628067
 ] 

Michael McCandless commented on LUCENE-1374:


Woops, you're right: I too see that failure (to rmDir the directory) only on 
Windows.  I'll commit a fix.  Thanks Chris!

> Merging of compressed string Fields may hit NPE
> ---
>
> Key: LUCENE-1374
> URL: https://issues.apache.org/jira/browse/LUCENE-1374
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1374.patch
>
>
> This bug was introduced with LUCENE-1219 (only present on 2.4).
> The bug happens when merging compressed string fields, but only if 
> bulk-merging code does not apply because the FieldInfos for the segment being 
> merged are not congruent.  This test shows the bug:
> {code}
>   public void testMergeCompressedFields() throws IOException {
> File indexDir = new File(System.getProperty("tempDir"), 
> "mergecompressedfields");
> Directory dir = FSDirectory.getDirectory(indexDir);
> try {
>   for(int i=0;i<5;i++) {
> // Must make a new writer & doc each time, w/
> // different fields, so bulk merge of stored fields
> // cannot run:
> IndexWriter w = new IndexWriter(dir, new WhitespaceAnalyzer(), i==0, 
> IndexWriter.MaxFieldLength.UNLIMITED);
> w.setMergeFactor(5);
> w.setMergeScheduler(new SerialMergeScheduler());
> Document doc = new Document();
> doc.add(new Field("test1", "this is some data that will be compressed 
> this this this", Field.Store.COMPRESS, Field.Index.NO));
> doc.add(new Field("test2", new byte[20], Field.Store.COMPRESS));
> doc.add(new Field("field" + i, "random field", Field.Store.NO, 
> Field.Index.TOKENIZED));
> w.addDocument(doc);
> w.close();
>   }
>   byte[] cmp = new byte[20];
>   IndexReader r = IndexReader.open(dir);
>   for(int i=0;i<5;i++) {
> Document doc = r.document(i);
> assertEquals("this is some data that will be compressed this this 
> this", doc.getField("test1").stringValue());
> byte[] b = doc.getField("test2").binaryValue();
> assertTrue(Arrays.equals(b, cmp));
>   }
> } finally {
>   dir.close();
>   _TestUtil.rmDir(indexDir);
> }
>   }
> {code}
> It's because in FieldsReader, when we load a field "for merge" we create a 
> FieldForMerge instance which subsequently does not return the right values 
> for getBinary{Value,Length,Offset}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Chris Hostetter


: saw, the distinction and rules are not quite clear. I would think though, if
: the new Similarity is really that much better than the old, it might actually
: benefit in core. There is no doubt core gets more attention on both the user
: and developer side, and important pieces with general usages should probably
: be there.

I see a Chicken/Egg argument here ... Perhaps contribs would get more 
attention if we used them more -- as in: put more stuff in them.

: I havn't used it myself, so I won't guess (too much ), but the question to
: me seems to be, is SweetSpot important enough to move to core? Are there
: enough good reasons? And even if so, is it ready to move to core? Contrib also
: seems to be somewhat of a possible incubation area...

I think that's the wrong question to ask.  I would rather ask the question 
"Is X decoupled enough from Lucene internals that it can be a contrib?"  
Things like IndexWriter, IndexReader, Document and TokenStream really need 
to be "core" ... but things like the QueryParser, and most of our 
analyzers don't.  Having lots of loosely coupled mini-libraries that 
respect good API boundaries seems more reusable and generally saner then 
"all of this code is useful and lots of people wnat it so throw it into 
the kitchen sink"

We don't need to go hog wild gutting things out of the core ... but i 
don't think we should be adding new things to the core just becuase they 
are "generally useful".


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Mark Miller

I would agree with you if I was wrong about the contrib/core attention 
thing, but I don't think I am. It seems as if you have been arguing that 
contrib is really just an extension of core, on par with core, but just 
in different libs, and to keep core lean and mean, anything not needed 
in core shouldn't be there - sounds like an idea I could get behind, but 
seems to ignore the reality:


The user/dev focus definitely seems to be on core. Some of contrib is a 
graveyard in terms of dev and use I think. I think its still entangled 
in its "sandbox" roots.


Contrib lacks many requirements of core code - it can be java 1.5, it 
doesn't have to be backward compatible, etc. Putting something in core 
ensures its treated as a Lucene first class citizen, stuff in contrib is 
not held to such strict standards.


Even down to the people working on the code, there is a lower bar to 
become a contrib commiter than a full core committer (see my contrib 
committer status ).


Its not that I don't like what you propose, but I don't buy it as very 
viable the way things are now. IMO we would need to do some work to make 
it a reality. It can be said thats the way it is, but my view of things 
doesnt jive with it.


I may have mis written "generally useful". What I meant was, if the 
sweet spot sim is better than the default sim, but a bit harder to use 
because of config, perhaps it is "core" enough to go there, as often it 
may be better to use. Again, I fully believe it would get more attention 
and be 'better' maintained. I did not mean to set the bar at "generally 
useful" and I apologize for my imprecise language (one of my many faults).
I think that's the wrong question to ask.  I would rather ask the question 
"Is X decoupled enough from Lucene internals that it can be a contrib?"  
Things like IndexWriter, IndexReader, Document and TokenStream really need 
to be "core" ... but things like the QueryParser, and most of our 
analyzers don't.  Having lots of loosely coupled mini-libraries that 
respect good API boundaries seems more reusable and generally saner then 
"all of this code is useful and lots of people wnat it so throw it into 
the kitchen sink"


We don't need to go hog wild gutting things out of the core ... but i 
don't think we should be adding new things to the core just becuase they 
are "generally useful".



-Hoss

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1313) Ocean Realtime Search

2008-09-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628092#action_12628092
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

Is there a good place to place the javadocs on the Apache website once they are 
more complete?  

> Ocean Realtime Search
> -
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Jason Rutherglen
> Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch
>
>
> Provides realtime search using Lucene.  Conceptually, updates are divided 
> into discrete transactions.  The transaction is recorded to a transaction log 
> which is similar to the mysql bin log.  Deletes from the transaction are made 
> to the existing indexes.  Document additions are made to an in memory 
> InstantiatedIndex.  The transaction is then complete.  After each transaction 
> TransactionSystem.getSearcher() may be called which allows searching over the 
> index including the latest transaction.
> TransactionSystem is the main class.  Methods similar to IndexWriter are 
> provided for updating.  getSearcher returns a Searcher class. 
> - getSearcher()
> - addDocument(Document document)
> - addDocument(Document document, Analyzer analyzer)
> - updateDocument(Term term, Document document)
> - updateDocument(Term term, Document document, Analyzer analyzer)
> - deleteDocument(Term term)
> - deleteDocument(Query query)
> - commitTransaction(List documents, Analyzer analyzer, List 
> deleteByTerms, List deleteByQueries)
> Sample code:
> {code}
> // setup
> FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), 
> "log");
> LogDirectory logDirectory = directoryMap.getLogDirectory();
> TransactionLog transactionLog = new TransactionLog(logDirectory);
> TransactionSystem system = new TransactionSystem(transactionLog, new 
> SimpleAnalyzer(), directoryMap);
> // transaction
> Document d = new Document();
> d.add(new Field("contents", "hello world", Field.Store.YES, 
> Field.Index.TOKENIZED));
> system.addDocument(d);
> // search
> OceanSearcher searcher = system.getSearcher();
> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> System.out.println(hits.length + " total results");
> for (int i = 0; i < hits.length && i < 10; i++) {
>   Document d = searcher.doc(hits[i].doc);
>   System.out.println(i + " " + hits[i].score+ " " + d.get("contents");
> }
> {code}
> There is a test class org.apache.lucene.ocean.TestSearch that was used for 
> basic testing.  
> A sample disk directory structure is as follows:
> |/snapshot_105_00.xml | XML file containing which indexes and their 
> generation numbers correspond to a snapshot.  Each transaction creates a new 
> snapshot file.  In this file the 105 is the snapshotid, also known as the 
> transactionid.  The 00 is the minor version of the snapshot corresponding to 
> a merge.  A merge is a minor snapshot version because the data does not 
> change, only the underlying structure of the index|
> |/3 | Directory containing an on disk Lucene index|
> |/log | Directory containing log files|
> |/log/log0001.bin | Log file.  As new log files are created the suffix 
> number is incremented|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Steven A Rowe

On 09/03/2008 at 2:00 PM, Chris Hostetter wrote:
> On 09/03/2008 at 8:40 AM, Mark Miller wrote:
> > I havn't used it myself, so I won't guess (too much ), but the
> > question to me seems to be, is SweetSpot important enough to move to
> > core? Are there enough good reasons? And even if so, is it ready to
> > move to core? Contrib also seems to be somewhat of a possible
> > incubation area...
> 
> I think that's the wrong question to ask.  I would rather ask the
> question "Is X decoupled enough from Lucene internals that it can be a
> contrib?" Things like IndexWriter, IndexReader, Document and TokenStream
> really need to be "core" ... but things like the QueryParser, and most
> of our analyzers don't.  Having lots of loosely coupled mini-libraries
> that respect good API boundaries seems more reusable and generally saner
> then "all of this code is useful and lots of people wnat it so throw it
> into the kitchen sink"
> 
> We don't need to go hog wild gutting things out of the core ... but i
> don't think we should be adding new things to the core just
> becuase they are "generally useful".

One of core's requirements is: no external dependencies.  Although many contrib 
components meet this requirement, there is no structural differentiation 
between them and those that don't.  So from the point of view of simplifying 
lawyers' licensing labors :), it might make sense to split off a 
"contrib-no-ext-deps".

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Michael McCandless


Another important driver is the "out-of-the-box experience".

It's crucial that Lucene has good starting defaults for everything
because many developers will stick with these defaults and won't
discover the wiki page that says you need to do X, Y and Z to get
better relevance, indexing speed, searching speed, etc.  This then
makes Lucene look bad, not only to these Lucene users but then also to
the end users who use their apps that say "Powered by Lucene".

It also affects Lucene's adoption/growth over time: when a potential
new user is just "trying Lucene out" we want our defaults to shine
because those new users will walk away if Lucene doesn't compare well
to other engines that are well-tuned out-of-the-box.

I remember a while back we discussed an article comparing performance
of various search engines and we were disappointed that the author
didn't do X, Y and Z to let Lucene compete fairly.  If we had good
defaults that wouldn't have happened (or, at least to a lesser
extent).

Obviously we can't default everything perfectly since at some point
there are hard tradeoffs to be made and every app is different, but if
SweetSpotSimilarity really gives better relevance for many/most apps,
and doesn't have any downsides (I haven't looked closely myself), I
think we should get it into core?

You know... it's almost like we need a "standard distro" (drawing
analogy to Linux) for Lucene, which would be the core plus cherry-pick
certain important contrib modules (highlighter, SweetSpotSimilarity,
snowball, spellchecker, etc.) and bundle them together.  See,
highlighting is obviously well "decoupled" from Lucene's core, so it
should remain in contrib, yet is also cleary a very important function
in nearly every search engine.

Mike

Mark Miller wrote:

I would agree with you if I was wrong about the contrib/core  
attention thing, but I don't think I am. It seems as if you have  
been arguing that contrib is really just an extension of core, on  
par with core, but just in different libs, and to keep core lean and  
mean, anything not needed in core shouldn't be there - sounds like  
an idea I could get behind, but seems to ignore the reality:


The user/dev focus definitely seems to be on core. Some of contrib  
is a graveyard in terms of dev and use I think. I think its still  
entangled in its "sandbox" roots.


Contrib lacks many requirements of core code - it can be java 1.5,  
it doesn't have to be backward compatible, etc. Putting something in  
core ensures its treated as a Lucene first class citizen, stuff in  
contrib is not held to such strict standards.


Even down to the people working on the code, there is a lower bar to  
become a contrib commiter than a full core committer (see my contrib  
committer status ).


Its not that I don't like what you propose, but I don't buy it as  
very viable the way things are now. IMO we would need to do some  
work to make it a reality. It can be said thats the way it is, but  
my view of things doesnt jive with it.


I may have mis written "generally useful". What I meant was, if the  
sweet spot sim is better than the default sim, but a bit harder to  
use because of config, perhaps it is "core" enough to go there, as  
often it may be better to use. Again, I fully believe it would get  
more attention and be 'better' maintained. I did not mean to set the  
bar at "generally useful" and I apologize for my imprecise language  
(one of my many faults).
I think that's the wrong question to ask.  I would rather ask the  
question "Is X decoupled enough from Lucene internals that it can  
be a contrib?"  Things like IndexWriter, IndexReader, Document and  
TokenStream really need to be "core" ... but things like the  
QueryParser, and most of our analyzers don't.  Having lots of  
loosely coupled mini-libraries that respect good API boundaries  
seems more reusable and generally saner then "all of this code is  
useful and lots of people wnat it so throw it into the kitchen sink"


We don't need to go hog wild gutting things out of the core ... but  
i don't think we should be adding new things to the core just  
becuase they are "generally useful".



-Hoss





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

solr2: Onward and Upward

2008-09-03 Thread Yonik Seeley

If you've considered Solr in the past, but for some reason it didn't
meet your needs, we'd love to hear from you over on solr-dev.  We're
starting to do some forward looking architecture work on the next
major version of Solr, so let us know what ideas you have and what
you'd like to see!

solr-dev thread:
http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html#a19224805

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen

Hello all,

I don't mean this to sound like a solicitation.  I've been working on
realtime search and created some Lucene patches etc.  I am wondering
if there are social networks (or anyone else) out there who would be
interested in collaborating with Apache on realtime search to get it
to the point it can be used in production.  It is a challenging
problem that only Google has solved and made to scale.  I've been
working on the problem for a while and though a lot has been
completed, there is still a lot more to do and collaboration amongst
the most probable users (social networks) seems like a good thing to
try to do at this point.  I guess I'm saying it seems like a hard
enough problem that perhaps it's best to work together on it rather
than each company try to complete their own.  However I could be
wrong.

Realtime search benefits social networks by providing a scalable
searchable alternative to large Mysql implementations.  Mysql I have
heard is difficult to scale at a certain point.  Apparently Google has
created things like BigTable (a large database) and an online service
called GData (which Google has not published any whitepapers on the
technology underneath) to address scaling large database systems.
BigTable does not offer search.   GData does and is used by all of
Google's web services instead of something like Mysql (this is at
least how I understand it).  Social networks usually grow and so
scaling is continually an issue.  It is possible to build a realtime
search system that scales linearly, something that I have heard
becomes difficult with Mysql.  There is an article that discusses some
of these issues
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
don't think the current GData implementation is perfect and there is a
lot that can be improved on.  It might be helpful to figure out
together what helpful things can be added.

If this sounds like something of interest to anyone feel free to send
your input.

Take care,
Jason

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar

2008-09-03 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106
 ] 

Steven Rowe commented on LUCENE-1126:
-

Yeah, I see this too.

The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the 
unpatched grammar's {LETTER} definition, which contains the huge range 
{{\u0100-\u1fff}}, much of which is not actually letters.  The patched grammar 
instead substitutes the Unicode 3.0 {{Letter}} general category (via JFlex's 
[:letter:]), which excludes some characters in the Thai range: non-spacing 
marks, a currency symbol, numerals, etc.

ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize 
the contiguous text (i.e. without whitespace) provided by StandardTokenizer.

The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets 
{{"\u0e17"}}, because {{\u0e35}} is a non-spacing mark, which the patched 
StandardTokenizer doesn't pass to ThaiWordFilter.

Because of this problem, I guess I'm -1 on applying the patch I provided.

One solution would be to switch from using the {{Letter}} general category to 
the derived property {{Alphabetic}}, which includes both general categories 
{{Letter}} and {{Mark}}. (see Annex C of [the Unicode Regular Expressions 
Technical 
Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties] 
under "alpha" for discussion of this).  The current version of JFlex does not 
support Unicode property references in its syntax, though, so simplifying -- 
and correcting -- the grammar may have to wait for the next version of JFlex, 
which will support syntax like {{\p{Alphabetic}}}.

> Simplify StandardTokenizer JFlex grammar
> 
>
> Key: LUCENE-1126
> URL: https://issues.apache.org/jira/browse/LUCENE-1126
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.2
>Reporter: Steven Rowe
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a 
> question on Korean ranges" begun by Daniel Noll on java-user, and carried 
> over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class 
> [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as 
> java.lang.Character.isLetter()) includes CJK characters, there is a way to 
> handle this using JFlex's regex negation syntax {{!}}.  From [the JFlex 
> documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is 
> !(!{{a}}|{{b}}) 
> So to exclude CJ characters from the LETTER macro:
> {code}
> LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason 
> (AFAICT) to treat them separately; unlike Chinese and Japanese characters, 
> which are individually tokenized, the Korean characters should participate in 
> the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 
> supports, and Unicode 5.0, the latest version, and there are lots of new and 
> modified letter and digit ranges.  This stuff gets tweaked all the time, and 
> I don't think Lucene should be in the business of trying to track it, or take 
> a position on which Unicode version users' data should conform to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character 
> classes ties (most of) these decisions to the user's choice of JVM version, 
> and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Reopened: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-03 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reopened LUCENE-1320:
-

Lucene Fields: [Patch Available]  (was: [Patch Available, New])

Despite the fact that we allow contribs to be 1.5, I don't think the analysis 
package should be 1.5, at least it shouldn't be made 1.5 without some 
discussion on the mailing list.

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.4
>
> Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-03 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1320:


Priority: Blocker  (was: Major)

I'm marking this as a blocker for 2.4 based on the Java 1.5 incompatibilities 
that were introduced. 

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread markharw00d


>>Another important driver is the "out-of-the-box experience".
>>we need a "standard distro" ...which would be the core plus 
cherry-pick certain important contrib modules (highlighter,
>> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them 
together. 

Is that not Solr, or at least the start of a path that ultimately ends 
up there?
I suspect any attempts at "bundling" Lucene code may snowball until 
you've rebuilt Solr.


If anything I suspect a more interesting initiative might be to 
"unbundle" Solr and see some more of it's features emerge as standalone 
modules in Lucene/contrib (or a suitably renamed area e.g. "extensions")?






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Yonik Seeley

On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
<[EMAIL PROTECTED]> wrote:
> I am wondering
> if there are social networks (or anyone else) out there who would be
> interested in collaborating with Apache on realtime search to get it
> to the point it can be used in production.

Good timing Jason, I think you'll find some other people right here
at Apache (solr-dev) that want to collaborate in this area:

http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html

I've looked at your wiki briefly, and all the high level goals/features seem
to really be synergistic with where we are going with Solr2.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Michael McCandless

markharw00d wrote:

>>Another important driver is the "out-of-the-box experience".
>>we need a "standard distro" ...which would be the core plus cherry- 
pick certain important contrib modules (highlighter,
>> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them  
together.
Is that not Solr, or at least the start of a path that ultimately  
ends up there?
I suspect any attempts at "bundling" Lucene code may snowball until  
you've rebuilt Solr.

Yeah I guess it is... though Solr includes the whole webapp too,  
whereas I think there's a natural bundle that wouldn't include that.

Still, I think it's important for Lucene itself to have strong  
defaults out of the box.

If anything I suspect a more interesting initiative might be to  
"unbundle" Solr and see some more of it's features emerge as  
standalone modules in Lucene/contrib (or a suitably renamed area  
e.g. "extensions")?

I like that!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Yonik Seeley

On Wed, Sep 3, 2008 at 4:55 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>> I suspect any attempts at "bundling" Lucene code may snowball until you've
>> rebuilt Solr.
>
> Yeah I guess it is... though Solr includes the whole webapp too, whereas I
> think there's a natural bundle that wouldn't include that.

One thing we are looking at for Solr2 is making it more useful for
advanced embedded users.
I expect a non-webapp version too.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Grant Ingersoll



On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote:


Obviously we can't default everything perfectly since at some point
there are hard tradeoffs to be made and every app is different, but if
SweetSpotSimilarity really gives better relevance for many/most apps,
and doesn't have any downsides (I haven't looked closely myself), I
think we should get it into core?


Well, we only have 2 data points here:  Hoss' original position that  
it was helpful, and Doron's Million Query work.  Has anyone else  
reported benefit?  And in that regard, the difference between OOTB and  
SweetSpot was 0.154 vs. 0.162 for MAP.  Not a huge amount, but still  
useful.  In that regard, there are other length normalization  
functions (namely approaches that don't favor very short documents as  
much) that I've seen benefit applications as well, but as Erik is  
(in)famous for saying "it depends".  In fact, if we go solely based on  
the million query work, we'd be better off having the Query Parser  
create phrase queries automatically for any query w/ more than 1 term  
(0.19 vs 0.154) before we even touch length normalization.


I've long argued that Lucene needs to take on the relevance question  
more head on, and in an open source way, until then, we are merely  
guessing at what's better, w/o empirical evidence that can be easily  
reproduced.   TREC is just one data point, and is often discounted as  
being all that useful in the real world.


I'm on the fence, though.  I agree w/ Hoss that core should be "core"  
and I don't think we want to throw more and more into core, but I also  
agree w/ Mike in that we want good, intelligent defaults for what we  
do have in core.


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-03 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628132#action_12628132
 ] 

Karl Wettin commented on LUCENE-1320:
-

OK. Either remove it or place it in some alternative contrib module? The first 
chooise is obviously the easiest.

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-09-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628154#action_12628154
 ] 

Michael McCandless commented on LUCENE-1131:


Otis is this one ready to go in?

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-09-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1350.


   Resolution: Duplicate
Fix Version/s: (was: 2.3.3)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Isn't this one now a dup of LUCENE-1333?

> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: LUCENE-1350-test.patch, LUCENE-1350.patch, 
> LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-09-03 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628158#action_12628158
 ] 

Doron Cohen commented on LUCENE-1350:
-

Yes it is a dup, thanks Mike for taking care of this (I planned to do this 
yesterday but didn't make it)

> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: LUCENE-1350-test.patch, LUCENE-1350.patch, 
> LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1356) Allow easy extensions of TopDocCollector

2008-09-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628163#action_12628163
 ] 

Michael McCandless commented on LUCENE-1356:


Doron is this one ready to go in?

> Allow easy extensions of TopDocCollector
> 
>
> Key: LUCENE-1356
> URL: https://issues.apache.org/jira/browse/LUCENE-1356
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.3.3, 2.4
>
> Attachments: 1356-2.patch, 1356.patch
>
>
> TopDocCollector's members and constructor are declared either private or 
> package visible. It makes it hard to extend it as if you want to extend it 
> you can reuse its *hq* and *totatlHits* members, but need to define your own. 
> It also forces you to override getTotalHits() and topDocs().
> By changing its members and constructor (the one that accepts a PQ) to 
> protected, we allow users to extend it in order to get a different view of 
> 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() 
> and topDocs() method implementations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread Doron Cohen

My thought was to move SSS to core as a step towards
making it the default, if and when there is more evidence it is
better than current default - it just felt right as a cautious
step - I mean first move it to core so that it is more exposed
and used, an only after a while, maybe, if there are mostly
positive evidences, make it the default.

On Thu, Sep 4, 2008 at 12:04 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

>
> On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote:
>
>>
>> Obviously we can't default everything perfectly since at some point
>> there are hard tradeoffs to be made and every app is different, but if
>> SweetSpotSimilarity really gives better relevance for many/most apps,
>> and doesn't have any downsides (I haven't looked closely myself), I
>> think we should get it into core?
>>
>
> Well, we only have 2 data points here:  Hoss' original position that it was
> helpful, and Doron's Million Query work.  Has anyone else reported benefit?
>  And in that regard, the difference between OOTB and SweetSpot was 0.154 vs.
> 0.162 for MAP.  Not a huge amount, but still useful.  In that regard, there
> are other length normalization functions (namely approaches that don't favor
> very short documents as much) that I've seen benefit applications as well,
> but as Erik is (in)famous for saying "it depends".  In fact, if we go solely
> based on the million query work, we'd be better off having the Query Parser
> create phrase queries automatically for any query w/ more than 1 term (0.19
> vs 0.154) before we even touch length normalization.
>
> I've long argued that Lucene needs to take on the relevance question more
> head on, and in an open source way, until then, we are merely guessing at
> what's better, w/o empirical evidence that can be easily reproduced.   TREC
> is just one data point, and is often discounted as being all that useful in
> the real world.
>
> I'm on the fence, though.  I agree w/ Hoss that core should be "core" and I
> don't think we want to throw more and more into core, but I also agree w/
> Mike in that we want good, intelligent defaults for what we do have in core.
>
> -Grant
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen

Hi Yonik,

The SOLR 2 list looks good.  The question is, who is going to do the
work?  I tried to simplify the scope of Ocean as much as possible to
make it possible (and slowly at that over time) for me to eventually
finish what is mentioned on the wiki.  I think SOLR is very cool and
was   major step forward when it came out.  I also think it's got a
lot of things now which makes integration difficult to do properly.  I
did try to integrate and received a lukewarm response and so decided
to just move ahead separately until folks have time to collaborate.
We probably should try to integrate SOLR and Ocean somehow however we
may want to simply reduce the scope a bit and figure what is needed
most, with the main use case being social networks.

I think the problem with integration with SOLR is it was designed with
a different problem set in mind than Ocean, originally the CNET
shopping application.  Facets were important, realtime was not needed
because pricing doesn't change very often.  I designed Ocean for
social networks and actually further into the future realtime
messaging based mobile applications.

SOLR needs to be backward compatible and support it's existing user
base.  How do you plan on doing this for a SOLR 2 if the architecture
is changed dramatically?  SOLR solves a problem set that is very
common making SOLR very useful in many situations.  However I wanted
Ocean to be like GData.  So I wanted the scalability of Google which
SOLR doesn't quite have yet, and the realtime, and then I figured the
other stuff could be added later, stuff people seem to spend a lot of
time on in the SOLR community currently (spellchecker, db imports,
many others).  I did use some of the SOLR terminology in building
Ocean, like snapshots!  But most of it is a digression.  I tried to
use schemas, but they just make the system harder to use.  For
distributed search I prefer serialized objects as this enables things
like SpanQueries and payloads without writing request handlers and
such.  Also there is no need to write new request handlers and deploy
(an expensive operation for systems that are in the 100s of servers)
them as any new classes are simply dynamically loaded by the server
from the client.

A lot is now outlined on the wiki site
http://wiki.apache.org/lucene-java/OceanRealtimeSearch now and there
will be a lot more javadocs in the forthcoming patch.  The latest code
is also available all the time at
http://oceansearch.googlecode.com/svn/trunk/trunk/oceanlucene

I do welcome more discussion and if there are Solr developers who wish
to work on Ocean feel free to drop me a line.  Most of all though I
think it would be useful for social networks interested in realtime
search to get involved as it may be something that is difficult for
one company to have enough resources to implement to a production
level.  I think this is where open source collaboration is
particularly useful.

Cheers,

Jason Rutherglen
[EMAIL PROTECTED]

On Wed, Sep 3, 2008 at 4:56 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
>> I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.
>
> Good timing Jason, I think you'll find some other people right here
> at Apache (solr-dev) that want to collaborate in this area:
>
> http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html
>
> I've looked at your wiki briefly, and all the high level goals/features seem
> to really be synergistic with where we are going with Solr2.
>
> -Yonik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1356) Allow easy extensions of TopDocCollector

2008-09-03 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628193#action_12628193
 ] 

Doron Cohen commented on LUCENE-1356:
-

It is, applies cleanly and seems correct. 
Will commit as soon as tests complete.


> Allow easy extensions of TopDocCollector
> 
>
> Key: LUCENE-1356
> URL: https://issues.apache.org/jira/browse/LUCENE-1356
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.3.3, 2.4
>
> Attachments: 1356-2.patch, 1356.patch
>
>
> TopDocCollector's members and constructor are declared either private or 
> package visible. It makes it hard to extend it as if you want to extend it 
> you can reuse its *hq* and *totatlHits* members, but need to define your own. 
> It also forces you to override getTotalHits() and topDocs().
> By changing its members and constructor (the one that accepts a PQ) to 
> protected, we allow users to extend it in order to get a different view of 
> 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() 
> and topDocs() method implementations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1356) Allow easy extensions of TopDocCollector

2008-09-03 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-1356.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Thanks Shai !

> Allow easy extensions of TopDocCollector
> 
>
> Key: LUCENE-1356
> URL: https://issues.apache.org/jira/browse/LUCENE-1356
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.3.3, 2.4
>
> Attachments: 1356-2.patch, 1356.patch
>
>
> TopDocCollector's members and constructor are declared either private or 
> package visible. It makes it hard to extend it as if you want to extend it 
> you can reuse its *hq* and *totatlHits* members, but need to define your own. 
> It also forces you to override getTotalHits() and topDocs().
> By changing its members and constructor (the one that accepts a PQ) to 
> protected, we allow users to extend it in order to get a different view of 
> 'top docs' (like TopFieldCollector does), but still enjoy its getTotalHits() 
> and topDocs() method implementations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-989) Statistics from ValueSourceQuery

2008-09-03 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-989:
---

Fix Version/s: (was: 2.4)
   3.0
 Assignee: (was: Doron Cohen)

This should be look at with LUCENE-1085 - removing myself to not so others who 
can do it sooner.


> Statistics from ValueSourceQuery 
> -
>
> Key: LUCENE-989
> URL: https://issues.apache.org/jira/browse/LUCENE-989
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Will Johnson
>Priority: Minor
> Fix For: 3.0
>
> Attachments: functionStats.patch
>
>
> Patch forthcoming that adds a DocValuesStats object that is optionally 
> computer for a ValueSourceQuery.  This ~replaces the getMin/Max/Avg from the 
> DocValues which were previously unaccessible without reasonably heavy 
> subclassing.  in addition it add a few more stats and provides a single 
> object to encapsulate all statistics going forward.  The stats object is tied 
> to the ValueSourceQuery so that the values can be cached without having to 
> maintain the full set of DocValues.  Test and javadocs included.
> - will

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1081) Remove the "Experimental" warnings from search.function package

2008-09-03 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1081:


Fix Version/s: (was: 2.4)
   3.0
 Assignee: (was: Doron Cohen)

Will depend on LUCENE-1085.


> Remove the "Experimental" warnings from search.function package
> ---
>
> Key: LUCENE-1081
> URL: https://issues.apache.org/jira/browse/LUCENE-1081
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Doron Cohen
>Priority: Minor
> Fix For: 3.0
>
>
> I am using this package for a while, seems that others in this list use it 
> too, so let's remove those warnings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1085) search.function should support all capabilities of Solr's search.function

2008-09-03 Thread Doron Cohen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1085:


Fix Version/s: (was: 2.4)
   3.0
 Assignee: (was: Doron Cohen)

> search.function should support all capabilities of Solr's search.function
> -
>
> Key: LUCENE-1085
> URL: https://issues.apache.org/jira/browse/LUCENE-1085
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Doron Cohen
>Priority: Minor
> Fix For: 3.0
>
>
> Lucene search.function does not allow Solr to move to use it, and so Solr 
> currently maintains its own version of this package.
> Enhance Lucene's search.function so that Solr can move to use it, and avoid 
> this redundancy. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-989) Statistics from ValueSourceQuery

2008-09-03 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628197#action_12628197
 ] 

doronc edited comment on LUCENE-989 at 9/3/08 4:38 PM:


This should be look at with LUCENE-1085 - removing myself to not block others 
who can do it sooner.


  was (Author: doronc):
This should be look at with LUCENE-1085 - removing myself to not so others 
who can do it sooner.

  
> Statistics from ValueSourceQuery 
> -
>
> Key: LUCENE-989
> URL: https://issues.apache.org/jira/browse/LUCENE-989
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Will Johnson
>Priority: Minor
> Fix For: 3.0
>
> Attachments: functionStats.patch
>
>
> Patch forthcoming that adds a DocValuesStats object that is optionally 
> computer for a ValueSourceQuery.  This ~replaces the getMin/Max/Avg from the 
> DocValues which were previously unaccessible without reasonably heavy 
> subclassing.  in addition it add a few more stats and provides a single 
> object to encapsulate all statistics going forward.  The stats object is tied 
> to the ValueSourceQuery so that the values can be cached without having to 
> maintain the full set of DocValues.  Test and javadocs included.
> - will

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

2008-09-03 Thread Mark Lassau


Grant Ingersoll (JIRA) wrote:

Of course, it's still a bit weird, b/c in your case the type value is going to 
be set to ACRONYM, when your example is clearly not one.  This suggests to me 
that the grammar needs to be revisited, but that can wait until 3.0 I believe.

  
Grant, not sure what you mean by "b/c in your case the type value is 
going to be set to ACRONYM, when your example is clearly not one."

Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interested 
to get in on the discussion on the behaviour of .
For instance, if you have a document like "visit www.apache.org", you 
currently won't get a hit if you search for "apache".
In an issue tracker like JIRA, we want to be able to search for 
"NullPointerException", and get a hit for the document "Application 
threw java.lang.NullPointerException".


Also note that the current implementation has problems if the document 
doesn't contain expected whitespace.

eg "I like Apache.They rock"
Will get tokenized to the following:
I 
like
Apache.They
rock   

I don't think there is a simple one-size-fits-all answer to how this 
should behave. It depends on the context of the app that is using Lucene.
The best answer may be to make some of the behaviour configurable, or 
have a suite of specific analyzers?


Mark.

Most of the contributed Analyzers suffer from invalid recognition of acronyms.
--

Key: LUCENE-1373
URL: https://issues.apache.org/jira/browse/LUCENE-1373
Project: Lucene - Java
 Issue Type: Bug
 Components: Analysis, contrib/analyzers
   Affects Versions: 2.3.2
   Reporter: Mark Lassau
   Priority: Minor

LUCENE-1068 describes a bug in StandardTokenizer whereby a string like 
"www.apache.org." would be incorrectly tokenized as an acronym (note the dot at 
the end).
Unfortunately, keeping the "backward compatibility" of a bug turns out to harm 
us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but 
unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the 
StandardTokenizer, and in v2.3.2 not one of these provides a way to get the 
non-buggy behaviour :(
I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer



  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-03 Thread Grant Ingersoll


Or just remove the generics, right?

On Sep 3, 2008, at 5:09 PM, Karl Wettin (JIRA) wrote:



   [ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628132 
#action_12628132 ]


Karl Wettin commented on LUCENE-1320:
-

OK. Either remove it or place it in some alternative contrib module?  
The first chooise is obviously the easiest.



ShingleMatrixFilter, a three dimensional permutating shingle filter
---

   Key: LUCENE-1320
   URL: https://issues.apache.org/jira/browse/LUCENE-1320
   Project: Lucene - Java
Issue Type: New Feature
Components: contrib/analyzers
  Affects Versions: 2.3.2
  Reporter: Karl Wettin
  Assignee: Karl Wettin
  Priority: Blocker
   Fix For: 2.4

   Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt


Backed by a column focused matrix that creates all permutations of  
shingle tokens in three dimensions. I.e. it handles multi token  
synonyms.
Could for instance in some cases be used to replaces 0-slop phrase  
queries with something speedier.

{code:java}
Token[][][]{
 {{hello}, {greetings, and, salutations}},
 {{world}, {earth}, {tellus}}
}
{code}
passes the following test  with 2-3 grams:
{code:java}
assertNext(ts, "hello_world");
assertNext(ts, "greetings_and");
assertNext(ts, "greetings_and_salutations");
assertNext(ts, "and_salutations");
assertNext(ts, "and_salutations_world");
assertNext(ts, "salutations_world");
assertNext(ts, "hello_earth");
assertNext(ts, "and_salutations_earth");
assertNext(ts, "salutations_earth");
assertNext(ts, "hello_tellus");
assertNext(ts, "and_salutations_tellus");
assertNext(ts, "salutations_tellus");
{code}
Contains more and less complex tests that demonstrate offsets,  
posincr, payload boosts calculation and construction of a matrix  
from a token stream.
The matrix attempts to hog as little memory as possible by seeking  
no more than maximumShingleSize columns forward in the stream and  
clearing up unused resources (columns and unique token sets). Can  
still be optimized quite a bit though.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

2008-09-03 Thread Shai Erera

I think we should distinguish between what is a bug and what is an attempt
of the tokenizer to produce a meaningful token. When the tokenizer outputs a
HOST or ACRONYM token type, there's nothing that prevents you from putting a
filter after the tokenizer that will use a UIMA Annotator (for example) and
verify that the output token type is indeed correct.

For example, in the case of java.lang.NullPointerException we all understand
it's not a HOST, but unfortunately our logic hasn't been translated well
into computer instructions, yet :-). However you treat this token now is up
to you:

- If you want to be able to search for the individual parts of the host, but
still find the full host, I'd put a TokenFilter after the tokenizer that
breaks the HOST to its parts and returns the parts along with the full host
name. During query time I'd then remove that filter (i.e. create an Analyzer
w/o that filter) and thus I'd be able to search for either "apache" or "
www.apache.org".

- If you want to actually verify the output HOST is indeed a host, again,
put a TokenFilter after the tokenizer and either apply your own simple
hueristics (for example if there's a ".com", ".org", ".net" it's a HOST,
otherwise it's not - I know these don't cover all HOST types, it's just an
example), or validate that with an external tool, like a UIMA Annotator.

- You can also decide that a 2 parts HOST is not really a host, that way you
solve the "I like Apache.They rock" problem, but miss a whole handful of
hosts like "ibm.com", "apache.org", "google.com".

Again, IMO, the logic in the tokenizer today for HOSTs and ACRONYMs are
"best effort" to produce a meaningful token. If we remove those rules, for
example, it'd be impossible to detect them because the tokenizer is set to
discard any stand alone "&", ".", "@" for example.

I'm going to send out another email to the list about a bug or incosistency
I recently found in the COMPANY rule. I don't want to mix this thread with a
different issue.

On Thu, Sep 4, 2008 at 5:17 AM, Mark Lassau <[EMAIL PROTECTED]> wrote:

> Grant Ingersoll (JIRA) wrote:
>
>> Of course, it's still a bit weird, b/c in your case the type value is
>> going to be set to ACRONYM, when your example is clearly not one.  This
>> suggests to me that the grammar needs to be revisited, but that can wait
>> until 3.0 I believe.
>>
>>
>>
> Grant, not sure what you mean by "b/c in your case the type value is going
> to be set to ACRONYM, when your example is clearly not one."
> Once we set replaceInvalidAcronym=true, then the type is set to HOST.
>
> However, if you were to revisit the grammar, then I would be interested to
> get in on the discussion on the behaviour of .
> For instance, if you have a document like "visit www.apache.org", you
> currently won't get a hit if you search for "apache".
> In an issue tracker like JIRA, we want to be able to search for
> "NullPointerException", and get a hit for the document "Application threw
> java.lang.NullPointerException".
>
> Also note that the current implementation has problems if the document
> doesn't contain expected whitespace.
> eg "I like Apache.They rock"
> Will get tokenized to the following:
> I 
> like
> Apache.They
> rock   
>
> I don't think there is a simple one-size-fits-all answer to how this should
> behave. It depends on the context of the app that is using Lucene.
> The best answer may be to make some of the behaviour configurable, or have
> a suite of specific analyzers?
>
> Mark.
>
>> Most of the contributed Analyzers suffer from invalid recognition of
>>> acronyms.
>>>
>>> --
>>>
>>>Key: LUCENE-1373
>>>URL: https://issues.apache.org/jira/browse/LUCENE-1373
>>>Project: Lucene - Java
>>> Issue Type: Bug
>>> Components: Analysis, contrib/analyzers
>>>   Affects Versions: 2.3.2
>>>   Reporter: Mark Lassau
>>>   Priority: Minor
>>>
>>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "
>>> www.apache.org." would be incorrectly tokenized as an acronym (note the
>>> dot at the end).
>>> Unfortunately, keeping the "backward compatibility" of a bug turns out to
>>> harm us.
>>> StandardTokenizer has a couple of ways to indicate "fix this bug", but
>>> unfortunately the default behaviour is still to be buggy.
>>> Most of the non-English analyzers provided in lucene-analyzers utilize
>>> the StandardTokenizer, and in v2.3.2 not one of these provides a way to get
>>> the non-buggy behaviour :(
>>> I refer to:
>>> * BrazilianAnalyzer
>>> * CzechAnalyzer
>>> * DutchAnalyzer
>>> * FrenchAnalyzer
>>> * GermanAnalyzer
>>> * GreekAnalyzer
>>> * ThaiAnalyzer
>>>
>>>
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PR

Is the COMPANY rule in StandardTokenizer valid?

2008-09-03 Thread Shai Erera

Hi

The COMPANY rule in StandardTokenizer is defined like this:

// Company names like AT&T and [EMAIL PROTECTED]
COMPANY=  {ALPHA} ("&"|"@") {ALPHA}

While this works perfect for AT&T and [EMAIL PROTECTED], it doesn't work well 
for
strings like widget&javascript&html. Now, the latter is obviously wrongly
typed, and should have been separated by spaces, but that's what a user
typed in a document, and now we need to treat it right (why don't they
understand the rules of IR and tokenization?). Normally I wouldn't care and
say this is one of the extreme cases, but unfortunately the tokenizer output
two tokens: widget&javascript and html. Now that bothers me - the user can
search for "html" and find the document, but not "javascript" or "widget",
which is a bit harder to explain to users, even the intelligent ones.

That got me thinking on whether this rule is properly defined, and what's
the purpose of it. Obviously it's an attempt to not break legal company
names on "&" and "@", but I'm not sure it covers all company name formats.
For example, AT&T can be written as "AT & T" (with spaces) and I've also
seen cases where it's written as ATT.

While you could say "it's a best effort case", users don't buy that. Either
you do something properly (doesn't have to be 100% accurate though), or you
don't do it at all (I hope that doesn't sound too harsh). That way it's easy
to explain to your users that you simply break on "&" or "@" (unless it's an
email). They may not like it, but you'll at least be consistent.

This rule slows StandardTokenizer's tokenization time, and eventually does
not produce consistent results. If we think it's important to detect these
tokens, then let's at least make it consistent by either:

- changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing
"AT&T", and "widget&javascript&html" as COMPANY. That at least will allow
developers to put a CompanyTokenFilter (for example) after the tokenizer to
break on "&" and "@" whenever there are more than 2 parts. We could also
modify StandardFilter (which already handles ACRONYM) to handle COMPANY that
way.

- changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so that
we recognize company names only if the pattern is followed by a space, dot,
dash, underscore, exclamation mark or question mark. That'll still recognize
AT&T, but won't recognize widget&javascript&html as COMPANY (which is good).

What do you think?

Shai

50 matches

Mail list logo